Has GPT been trained with pay-walled scientific papers or copyrighted books, journals, etc?

For example Springer, Addison-Wesley, IEEE, ACM, etc. papers.

Also what about old magazines, books, and journals on Web Archive (e.g. Amiga Books : Free Texts : Free Download, Borrow and Streaming : Internet Archive)?



From everything I’ve read I do believe at least gpt-3 was trained off paywalled content accessed thought sites that circumvent them. I can’t confirm whether gpt-4 has, but I’d place a guess and say it has. I think this may end up being a big sticking point when it comes to future model training. It may lead to trying to train models with less data to prevent needing the paywalled data.

AI explained has a great new video on the data used and future ways it will change.
Here is the video: What's Behind the ChatGPT History Change? How You Can Benefit + The 6 New Developments This Week - YouTube

1 Like