I was working on what I am calling a Knowledge System today and I was thinking about the fact that GPT-3 has more knowledge embedded than any 10 humans. One of the test cases I use is to ask my cognitive architecture about the future of nuclear fusion - a problem that requires predicting the future (and therefore a good test of intelligence). Anyways, it occurred to me that GPT-3 already knows more about nuclear fusion than everyone except the experts. So what kind of data do you need to give GPT-3 to keep it honest?
The answer is: not much. GPT-3 only needs tiny nudges of hard facts and a sprinkling of current news to grasp a topic.
Most news, blogs, and Wikipedia articles are written with the lay person in mind - someone who needs to be reminded of the basic facts of a topic. Like what a tokamak is. GPT-3 needs no such reminders and therefore it can benefit from tiny summaries of current events to keep it on track and to extract profound insights.
This leads me to believe that, in the future, there will be a need for curated datasets. Certainly, sets like The Pile are for training, but I think a much lighter set will be needed for reference.
This is where technologies such as knowledge graphs might be extremely useful for GPT-3 chatbots and even AGI. All you need is a quick access to verify facts to keep the system honest.
Today I had some great success in automatically distilling articles down for this purpose. There’s still some tweaking and fine-tuning to do, but tomorrow I plan to start testing this model, to see if it really works as well on more problems. Functionally, these distilled versions of articles can be stored in a database and later used for question answering. By reducing the volume of reference text needed by a factor of 10:1, databases can be smaller, faster, and more efficient. This in turn results in better AGI systems.