Has GPT been trained with pay-walled scientific papers or copyrighted books, journals, etc?

aggressiveGarlic · April 30, 2023, 2:26pm

For example Springer, Addison-Wesley, IEEE, ACM, etc. papers.

Also what about old magazines, books, and journals on Web Archive (e.g. Amiga Books : Free Texts : Free Download, Borrow and Streaming : Internet Archive)?

Thanks.

jonathanrayreed · April 30, 2023, 9:24pm

From everything I’ve read I do believe at least gpt-3 was trained off paywalled content accessed thought sites that circumvent them. I can’t confirm whether gpt-4 has, but I’d place a guess and say it has. I think this may end up being a big sticking point when it comes to future model training. It may lead to trying to train models with less data to prevent needing the paywalled data.

jonathanrayreed · April 30, 2023, 9:28pm

AI explained has a great new video on the data used and future ways it will change.
Here is the video: What's Behind the ChatGPT History Change? How You Can Benefit + The 6 New Developments This Week - YouTube

Topic		Replies	Views
Does OpenAI use book data, such as ebooks or scanned PDFs? API	3	1108	July 9, 2024
What kind of academic resources has GPT-3 been trained on? API	12	1392	April 20, 2024
How OpenAI can become a more fact and credit-based information resource Community gpt-4	2	101	November 3, 2024
What's the best practice when using a copyrighted file for GPT? GPT builders	12	3758	January 8, 2024
Want Internet Access with gpt-4 openAI model to collect latest information about any company or anything Community gpt-4 , chatgpt , api , langchain , openai	9	7412	November 29, 2023

Has GPT been trained with pay-walled scientific papers or copyrighted books, journals, etc?

Related topics