I’m having difficulty finding the size of the data used to train GPT-3. Searches return wildly divergent answers, anywhere from 570GB to 45TB. Language Models are Few-shot Learners would seem to be the definitive source. The largest training set was CommonCrawl which “. . . was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering and 570GB after filtering, roughly equivalent to 400 billion byte-pair-encoded tokens.” That explains the divergent answers I’m seeing. But the CommonCrawl was not the only data source, and a corresponding table indicates that it was only 60% of the data used. That gives me 570/0.6=950GB for the training set. Am I missing anything?
I ask because the GPT-3 model is 800GB. Am I crazy thinking it is bizarre that you have a statistical model that is only fractionally smaller than the data it is trained on?
I’m not any kind of expert on large language models or deep learning, so I’m happy to be enlightened.
I can show that the smaller GPT-3 models were trained on less data, by omissions so as to not overfit. We test what they know.
ada-001 (instruction-following tune):
We’ll try to make it complete the most obvious line of a song:
The output skips to the next line, with the correct word in only at 2.32%.
oops, meant to use the base model ada for completions, here’s that:
Results without poem tuning is just uninformed repetition, in at 10%.
So iterate until we find knowledge. babbage gives in as likely as a comma, but a new line wins by far (then repeating the input):
I tried to make it even more obvious for curie, “just give me the last two words of a line”, but still wrong:
Finally davinci can do it, with its 25x jump in model parameters.
We can see having the knowledge gives AI a massive jump to 97.7%:
Interestingly, the new babbage-002 gives us hints it has been trained on large data, but instead we get the model’s extreme perplexity and quick degradation, making it often worse than ada for general tasks:
(Did they just take their davinci-002 and give it 1.5 bit resolution?)
It is possible to go from knowing to not, based on less coefficients.
The reason is that the same volume simply cannot be captured by the much smaller amount of coefficients (weights) in the model.
So if I took the picture I posted above, then superimposed a high frequency sine wave on top of it, the higher coefficient model could potentially follow this detail, whereas the lower model would just stick to the general trend above.
I could train two models, one tiny, and one big, on the same big data, and the tiny model would “forget” and not understand, because it lacks the capacity to understand.
But I find your repeating davinci-002 interesting, I have observed the same problems
I suppose it is reasonable that language training cannot serve as a lossless compression method to turn 40TB into 40GB. That there is no weighting or training at all would seem unexpected, though.
It would be interesting to get a return of all logits for a position, even if a research feature available only for one token.
Here’s an interesting paper where OpenAI digs into the actual neurons of knowledge of GPT-2, extracting activating text from them.