The Pile Dataset (800GB) for the next GPT

cristi · April 25, 2021, 7:24pm

I’m not sure how many researchers or people who directly work with GPT3 are here, but given the relatively not-so-good quality of the datasets that GPT3 has been trained on and the fact that it is so 2019, I’d like to suggest looking into the Pile dataset recently published by EleutherAI team.

The Pile is 800GB, well curated dataset. See img:

The full paper is here: [2101.00027] The Pile: An 800GB Dataset of Diverse Text for Language Modeling

I have no affiliation with EleutherAI but I’m baffled by the work they’ve put in making this dataset available. And it takes data from the second half of 2020.

They used the Pile to train their GPT-Neo transformers, but to be honest, with 2.7B Parameters, the models don’t fare too well compared to GPT3.

And I can only imagine what would have when you combine: GPT3 + The Pile (a 2021 version of it).

If someone from the core team wants to discuss this further, I have some ideas that could be worth exploring.

daveshapautomator · April 26, 2021, 9:56pm

I see that NIH and ArXiv are both on there already… now just to add PLOS ONE!

cristi · April 27, 2021, 8:02am

I just hope they take it in cosideration

gwern · April 30, 2021, 2:19am

There is no detailed description. OA has been asked repeatedly, but has never disclosed even what some of the datasets (like the books datasets) are.

cristi · April 30, 2021, 8:21am

well, for what it’s worth…I hope they’ve used an unofficial version of book dataset lol

Topic		Replies	Views
A potential new source of training data!? Community community-feedback	1	171	July 3, 2024
349.000 GPTs free dataset with details (BeeTrove) GPT builders	3	1538	March 19, 2024
Training data - books1 & books2 API	2	2819	July 7, 2021
⚡ Introducing GPT Repo: An Open-Sourced Collection of Advanced GPTs Prompting gpt-4 , chatgpt , plugin-development , api	1	7846	March 8, 2024
What are books1 and books2? API	0	6743	August 24, 2021

The Pile Dataset (800GB) for the next GPT

Related topics