What is the size of the training set for GPT-3

abcarter · September 8, 2023, 8:49pm

I’m having difficulty finding the size of the data used to train GPT-3. Searches return wildly divergent answers, anywhere from 570GB to 45TB. Language Models are Few-shot Learners would seem to be the definitive source. The largest training set was CommonCrawl which “. . . was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering and 570GB after filtering, roughly equivalent to 400 billion byte-pair-encoded tokens.” That explains the divergent answers I’m seeing. But the CommonCrawl was not the only data source, and a corresponding table indicates that it was only 60% of the data used. That gives me 570/0.6=950GB for the training set. Am I missing anything?

I ask because the GPT-3 model is 800GB. Am I crazy thinking it is bizarre that you have a statistical model that is only fractionally smaller than the data it is trained on?

I’m not any kind of expert on large language models or deep learning, so I’m happy to be enlightened.

Topic		Replies	Views
Discussion thread for "Foundational must read GPT/LLM papers" Community gpt-4 , gpt-35-turbo , chatgpt , research	75	10036	September 3, 2024
What version of GPT is `text-embedding-ada-002` based on? API embeddings , api	7	8077	September 30, 2023
Fine-Tuning Setup for gpt-3.5-turbo-16k API fine-tuning , api	9	3595	October 31, 2023
Do 'MAX tokens' include the follow up prompts and completion in a single chat session API token	22	4979	August 25, 2023
Gpt-4o-2024-11-20 megathread - new API model released API	0	1397	November 22, 2024

What is the size of the training set for GPT-3

Related topics