Training data - books1 & books2

Bec.Johnson · July 7, 2021, 3:44am

Hi there everyone,

Just wondering if someone can illuminate me or point me in the right direction about the training data cited in the original OpenAI paper “Books1” and “Books2”. Are these public repositories that we can look at? If not, are there descriptions of their contents, for instance what kind of books are used and in what languages?

Thanks for your advice!
Bec

Bec.Johnson · July 7, 2021, 4:30am

Thanks, @m-a.schenk; that is what it seemed to me, but I wanted to check I wasn’t overlooking something obvious. I am leading a team of seven PhD researchers testing the values of GPT-3 from a diverse range of countries and languages. We have started to notice interesting differences in outputs. We were hoping to better understand the nature of Books1 and 2 to more clearly interpret the results we are seeing.

Bec.Johnson · July 7, 2021, 5:09am

AH yes, I did, but a few weeks ago. Good call, I have already drafted a letter to them so will follow up on that. Just thought someone here might know. I wasn’t sure if it was open knowledge that I just hadn’t yet located.
Thanks!

Topic		Replies	Views
What are books1 and books2? API	0	6146	August 24, 2021
Does OpenAI use book data, such as ebooks or scanned PDFs? API	3	952	July 9, 2024
I have a book, I want OpenAI to be trained on the book. How can I do it? Community gpt-4 , chatgpt	5	2170	June 13, 2023
The Pile Dataset (800GB) for the next GPT API	5	1610	January 3, 2024
What kind of academic resources has GPT-3 been trained on? API	12	1225	April 20, 2024

Training data - books1 & books2

Related topics