What kind of academic resources has GPT-3 been trained on?

I’ve been working with GPT-3 and citations, I have to say, I’m quite impressed by how rarely GPT-3 doesn’t know what I’m talking about. I would like to try and develop some sort of verification for GPT-3’s results and to a large degree, that means I need to know what ‘academic training’ it had. Can I get some clarity on this?

Thank you.

1 Like

Now that the finetuning endpoint is available you can train it on any academic source you want! PubMed, PLOS ONE, etc.

Just check if it’s public domain, which all government funded research is: Policies and Disclaimers - NCBI

Can you please go further into that? I saw that the file sizes were quite liberal, which is really cool, but it still doesn’t cover my question of where GPT-3 gets its research papers from (up until 2019). So, I’m more interested in what GPT-3 has already been trained in, although the new fine-tuning functionalities will surely come into play.

AFAIK they scraped data from the internet. I assume they did something similar to what EleutherAI is doing: https://pile.eleuther.ai/

Is there any way to get accurate results on what GPT-3 knows and doesn’t? I think the hardest thing I’m facing right now is being sure whether the result is accurate or just ‘filling’ in the blanks. I’m narrowing that down, but I’m having a hard time having GPT-3 co-operate. It hates admitting it didn’t read a specific article or so.

I would not rely on GPT-3’s knowledge where science is concerned. It doesn’t “know” what it read or not, it has no sense of pedagogy or epistemology. That’s why I recommended fine-tuning it on the academic literature that you want. GPT-3 is, first and foremost, a language model. It might have some knowledge embedded. In my book I recommend a “trust but verify” strategy, aka “guess and check” - that is until you finetune it on your empirical dataset.


I don’t know how I’ll do that yet, but it seems like that’s my next focus.

So, playing around with the frequency (putting it higher) resulted in much better outputs, I believe this is because it was finding a way to explain itself off the titles. Also, I moved the prompts around in a way that randomizes the listing hierarchy a little more. Whether GPT-3 has any epistemic assurance of what it outputs, I don’t think it has any at all, but it seems to me like this is all a matter of somehow narrowing the probability down in a way that it reflects the ‘knowledge’ it has.

1 Like

The Answers endpoint is best, in my experience. I use some basic queries and keywords to pull documents from SOLR and then use Answers to validate.

I will give it a try!

Can you go more into detail with this? I’m looking to applying something similar with my user accounts in some journals and such. So far, I also thought of making a document parser that counts the times a word was used or what percentages are most represented and use that as verification. One side-effect is that the title tends to include those words and GPT-3 would still ‘guess’ its way through, but I believe finding a way to nip the problem at the bud using GPT-3 prompts would be the most excellent way and then supplement that with more robust verification.

The biggest problem with relying exclusively on GPT-3 is cost, speed, and scalability. It really depends on how much data you have and how big each record is. I was using all of Wikipedia so I needed something capable of handling that much data and searching it very quickly. So I started by using naive queries to pull bulk data and then using the more expensive GPT-3 for fine-grained search and QA.

1 Like