The Pile Dataset (800GB) for the next GPT

I’m not sure how many researchers or people who directly work with GPT3 are here, but given the relatively not-so-good quality of the datasets that GPT3 has been trained on and the fact that it is so 2019, I’d like to suggest looking into the Pile dataset recently published by EleutherAI team.

The Pile is 800GB, well curated dataset. See img:

The full paper is here: [2101.00027] The Pile: An 800GB Dataset of Diverse Text for Language Modeling

I have no affiliation with EleutherAI but I’m baffled by the work they’ve put in making this dataset available. And it takes data from the second half of 2020.

They used the Pile to train their GPT-Neo transformers, but to be honest, with 2.7B Parameters, the models don’t fare too well compared to GPT3.

And I can only imagine what would have when you combine: GPT3 + The Pile (a 2021 version of it).

If someone from the core team wants to discuss this further, I have some ideas that could be worth exploring.

6 Likes

I see that NIH and ArXiv are both on there already… now just to add PLOS ONE!

1 Like

I just hope they take it in cosideration

There is no detailed description. OA has been asked repeatedly, but has never disclosed even what some of the datasets (like the books datasets) are.

2 Likes

well, for what it’s worth…I hope they’ve used an unofficial version of book dataset lol