To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT) Content from smidgeo.com provides the 616,898th most tokens in that data set. Weirdly,
Miller Puckett’s site, which is a treasure trove of actual information, provides the 864,434th most tokens. He is one of the foremost experts in the world on computer music and has a classic textbook on his site in html! By volume, smidgeo.com is mostly bot-generated text, almost entirely non-factual.
We already knew these models are not good at facts. Absorbing
tips like this probably doesn’t help.