Training data - books1 & books2

Hi there everyone,

Just wondering if someone can illuminate me or point me in the right direction about the training data cited in the original OpenAI paper “Books1” and “Books2”. Are these public repositories that we can look at? If not, are there descriptions of their contents, for instance what kind of books are used and in what languages?

Thanks for your advice!


Thanks, @m-a.schenk; that is what it seemed to me, but I wanted to check I wasn’t overlooking something obvious. I am leading a team of seven PhD researchers testing the values of GPT-3 from a diverse range of countries and languages. We have started to notice interesting differences in outputs. We were hoping to better understand the nature of Books1 and 2 to more clearly interpret the results we are seeing.


AH yes, I did, but a few weeks ago. Good call, I have already drafted a letter to them so will follow up on that. Just thought someone here might know. I wasn’t sure if it was open knowledge that I just hadn’t yet located.

1 Like