Reference data for GPT-3 need far less detail

I was working on what I am calling a Knowledge System today and I was thinking about the fact that GPT-3 has more knowledge embedded than any 10 humans. One of the test cases I use is to ask my cognitive architecture about the future of nuclear fusion - a problem that requires predicting the future (and therefore a good test of intelligence). Anyways, it occurred to me that GPT-3 already knows more about nuclear fusion than everyone except the experts. So what kind of data do you need to give GPT-3 to keep it honest?

The answer is: not much. GPT-3 only needs tiny nudges of hard facts and a sprinkling of current news to grasp a topic.

Most news, blogs, and Wikipedia articles are written with the lay person in mind - someone who needs to be reminded of the basic facts of a topic. Like what a tokamak is. GPT-3 needs no such reminders and therefore it can benefit from tiny summaries of current events to keep it on track and to extract profound insights.

This leads me to believe that, in the future, there will be a need for curated datasets. Certainly, sets like The Pile are for training, but I think a much lighter set will be needed for reference.

This is where technologies such as knowledge graphs might be extremely useful for GPT-3 chatbots and even AGI. All you need is a quick access to verify facts to keep the system honest.

Today I had some great success in automatically distilling articles down for this purpose. There’s still some tweaking and fine-tuning to do, but tomorrow I plan to start testing this model, to see if it really works as well on more problems. Functionally, these distilled versions of articles can be stored in a database and later used for question answering. By reducing the volume of reference text needed by a factor of 10:1, databases can be smaller, faster, and more efficient. This in turn results in better AGI systems.


I think this is a neat idea. A good name for it might be Knowledge Compression.
So I can imagine using it like this:

  1. Ask GPT-3 questions on a topic to generate a list of answers.
  2. Get experts to curate and make corrections to the answers
  3. Have GPT-3 compress or summaries the answers.
  4. Construct a knowledge base of the compressed answers.
  5. Have GPT-3 answer future questions from the knowledge base.
  6. Have GPT-3 uncompress the summaries into more compete answers if asked to.

I’ve been working on this problem some more. I may have untangled it partly.

If you start with very short/small entries (such as individual memories, chat logs, news bytes from RSS feeds, etc) - break down your database into very atomic entries. A few sentences each, maximum.

Then you can use an index/search tool (like SOLR or ElasticSearch) to find related/relevant snippets, even if they are from very different sources. Those disparate sources can be news articles, wikipedia articles, previous conversations, PubMed papers, etc. Then, with very short snippets, you can rapidly compile them into a reasonably sized document.

Then, with that reasonably sized document, you can rely on GPT-3’s internal understanding of the world to produce good answers to any problem. (in theory, this last part may be wishful thinking on my part).

In the future, hopefully GPT-4 can ingest 20,000 tokens instead of 2,000, so you can give it larger chunks of information. Maybe GPT-5 can take in 2M tokens.

Anyways, in the meantime, I think atomic/granular entries are the way to go.


Hey cognitive radio technology in my collage project