What are books1 and books2?

daphnei · August 24, 2021, 3:20pm

The GPT-3 paper says the models were trained on filtered Common Crawl, WebText2, Books1, Books2, and Wikipedia.

Is one of these books datasets Project Gutenberg? If not, is there any public information about these datasets?

Thanks.

Topic		Replies	Views
Training data - books1 & books2 API	2	2677	July 7, 2021
Does OpenAI use book data, such as ebooks or scanned PDFs? API	3	1031	July 9, 2024
Has GPT been trained with pay-walled scientific papers or copyrighted books, journals, etc? API	2	1469	April 30, 2023
The Pile Dataset (800GB) for the next GPT API	5	1709	January 3, 2024
Custom datasets? API	3	505	December 27, 2023