What kind of academic resources has GPT-3 been trained on?

IsaacTheBrave · July 15, 2021, 2:36pm

I’ve been working with GPT-3 and citations, I have to say, I’m quite impressed by how rarely GPT-3 doesn’t know what I’m talking about. I would like to try and develop some sort of verification for GPT-3’s results and to a large degree, that means I need to know what ‘academic training’ it had. Can I get some clarity on this?

Thank you.

daveshapautomator · July 15, 2021, 3:09pm

Now that the finetuning endpoint is available you can train it on any academic source you want! PubMed, PLOS ONE, etc.

daveshapautomator · July 15, 2021, 3:42pm

Just check if it’s public domain, which all government funded research is: Policies and Disclaimers - NCBI

IsaacTheBrave · July 15, 2021, 3:44pm

Can you please go further into that? I saw that the file sizes were quite liberal, which is really cool, but it still doesn’t cover my question of where GPT-3 gets its research papers from (up until 2019). So, I’m more interested in what GPT-3 has already been trained in, although the new fine-tuning functionalities will surely come into play.

daveshapautomator · July 15, 2021, 4:07pm

AFAIK they scraped data from the internet. I assume they did something similar to what EleutherAI is doing: https://pile.eleuther.ai/

IsaacTheBrave · July 15, 2021, 4:09pm

Is there any way to get accurate results on what GPT-3 knows and doesn’t? I think the hardest thing I’m facing right now is being sure whether the result is accurate or just ‘filling’ in the blanks. I’m narrowing that down, but I’m having a hard time having GPT-3 co-operate. It hates admitting it didn’t read a specific article or so.

daveshapautomator · July 15, 2021, 4:19pm

I would not rely on GPT-3’s knowledge where science is concerned. It doesn’t “know” what it read or not, it has no sense of pedagogy or epistemology. That’s why I recommended fine-tuning it on the academic literature that you want. GPT-3 is, first and foremost, a language model. It might have some knowledge embedded. In my book I recommend a “trust but verify” strategy, aka “guess and check” - that is until you finetune it on your empirical dataset.

IsaacTheBrave · July 15, 2021, 4:24pm

I don’t know how I’ll do that yet, but it seems like that’s my next focus.

IsaacTheBrave · July 15, 2021, 4:34pm

So, playing around with the frequency (putting it higher) resulted in much better outputs, I believe this is because it was finding a way to explain itself off the titles. Also, I moved the prompts around in a way that randomizes the listing hierarchy a little more. Whether GPT-3 has any epistemic assurance of what it outputs, I don’t think it has any at all, but it seems to me like this is all a matter of somehow narrowing the probability down in a way that it reflects the ‘knowledge’ it has.

daveshapautomator · July 15, 2021, 5:40pm

The Answers endpoint is best, in my experience. I use some basic queries and keywords to pull documents from SOLR and then use Answers to validate.

IsaacTheBrave · July 15, 2021, 5:59pm

I will give it a try!

Can you go more into detail with this? I’m looking to applying something similar with my user accounts in some journals and such. So far, I also thought of making a document parser that counts the times a word was used or what percentages are most represented and use that as verification. One side-effect is that the title tends to include those words and GPT-3 would still ‘guess’ its way through, but I believe finding a way to nip the problem at the bud using GPT-3 prompts would be the most excellent way and then supplement that with more robust verification.

daveshapautomator · July 15, 2021, 6:27pm

The biggest problem with relying exclusively on GPT-3 is cost, speed, and scalability. It really depends on how much data you have and how big each record is. I was using all of Wikipedia so I needed something capable of handling that much data and searching it very quickly. So I started by using naive queries to pull bulk data and then using the more expensive GPT-3 for fine-grained search and QA.

Topic		Replies	Views
Use "private" dataset as basis for AI responses Prompting	29	2954	December 16, 2023
Summarizing or question answering from long Wikipedia articles? API	25	4080	January 4, 2024
Idea of context for GPT 3 API API	15	3701	December 15, 2023
How do I "upload" a book to GPT3? API	17	24450	December 13, 2023
Limits and limits and limits API	2	1480	May 31, 2021

What kind of academic resources has GPT-3 been trained on?

Related topics