# What is the size of the training set for GPT-3

I’m having difficulty finding the size of the data used to train GPT-3. Searches return wildly divergent answers, anywhere from 570GB to 45TB. Language Models are Few-shot Learners would seem to be the definitive source. The largest training set was CommonCrawl which “. . . was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering and 570GB after filtering, roughly equivalent to 400 billion byte-pair-encoded tokens.” That explains the divergent answers I’m seeing. But the CommonCrawl was not the only data source, and a corresponding table indicates that it was only 60% of the data used. That gives me 570/0.6=950GB for the training set. Am I missing anything?

I ask because the GPT-3 model is 800GB. Am I crazy thinking it is bizarre that you have a statistical model that is only fractionally smaller than the data it is trained on?

I’m not any kind of expert on large language models or deep learning, so I’m happy to be enlightened.

Unfortunately, it’s not that straightforward.

All GPT-3 models were trained for a total of 300B tokens. Some tokens would have been sampled many times, others only once.

We can figure out the number of unique tokens used to train GPT-3 by multiplying the number of tokens in each set by the number of epochs for each set over the span of 300B tokens (with a maximum of 1 to get unique tokens) and summing then together.

\sum_\text{Datasets} n \times \max{(n_{epochs @ 300B}, 1)}

And we get a value of 238B unique training tokens.

From that you can compute the total size of the training data.

See,

And,

I guess the way I look at it is that the LLM is essentially a curve fit.

So multiple points go into the training, and what you are left with are the smoothing coeffiecients, which occupy much less bits.

I would say that if the LLM is bigger, or the same size of the training data, means the LLM is actually under-trained, because of this phenomena.

I can show that the smaller GPT-3 models were trained on less data, by omissions so as to not overfit. We test what they know.

We’ll try to make it complete the most obvious line of a song:

The output skips to the next line, with the correct word  in only at 2.32%.

oops, meant to use the base model ada for completions, here’s that:

Results without poem tuning is just uninformed repetition,  in at 10%.

So iterate until we find knowledge. babbage gives  in as likely as a comma, but a new line wins by far (then repeating the input):

I tried to make it even more obvious for curie, “just give me the last two words of a line”, but still wrong:

Finally davinci can do it, with its 25x jump in model parameters.
We can see having the knowledge gives AI a massive jump to 97.7%:

Interestingly, the new babbage-002 gives us hints it has been trained on large data, but instead we get the model’s extreme perplexity and quick degradation, making it often worse than ada for general tasks:

(Did they just take their davinci-002 and give it 1.5 bit resolution?)

1 Like

I don’t think this shows what you think it shows.

We know, from the paper, that the entire possible training set is about 500B tokens for the GPT-3 models.

We know, from the paper, that all models were trained on a total of 300B tokens.

There’s zero evidence that smaller models were trained on less data.

In fact, you can just read the Details of Model Training appendix of the paper,

to see that is not the case.

I should have offered a clearer example - can GPT-3 models complete a very unlikely word “Jupiter” from a lyric?

davinci

curie

Explain how GPT-3 175b goes from knowing to GPT-3 6.7b dramatically not knowing if not by training only on a subset of corpus?

It is possible to go from knowing to not, based on less coefficients.

The reason is that the same volume simply cannot be captured by the much smaller amount of coefficients (weights) in the model.

So if I took the picture I posted above, then superimposed a high frequency sine wave on top of it, the higher coefficient model could potentially follow this detail, whereas the lower model would just stick to the general trend above.

I could train two models, one tiny, and one big, on the same big data, and the tiny model would “forget” and not understand, because it lacks the capacity to understand.

But I find your repeating davinci-002 interesting, I have observed the same problems

1 Like

I suppose it is reasonable that language training cannot serve as a lossless compression method to turn 40TB into 40GB. That there is no weighting or training at all would seem unexpected, though.

It would be interesting to get a return of all logits for a position, even if a research feature available only for one token.

Here’s an interesting paper where OpenAI digs into the actual neurons of knowledge of GPT-2, extracting activating text from them.

1 Like