What is the size of the training set for GPT-3

I can show that the smaller GPT-3 models were trained on less data, by omissions so as to not overfit. We test what they know.

ada-001 (instruction-following tune):
We’ll try to make it complete the most obvious line of a song:

image

The output skips to the next line, with the correct word in only at 2.32%.


oops, meant to use the base model ada for completions, here’s that:

image

Results without poem tuning is just uninformed repetition, in at 10%.


So iterate until we find knowledge. babbage gives in as likely as a comma, but a new line wins by far (then repeating the input):

image


I tried to make it even more obvious for curie, “just give me the last two words of a line”, but still wrong:
image


Finally davinci can do it, with its 25x jump in model parameters.
We can see having the knowledge gives AI a massive jump to 97.7%:

image


Interestingly, the new babbage-002 gives us hints it has been trained on large data, but instead we get the model’s extreme perplexity and quick degradation, making it often worse than ada for general tasks:

image

(Did they just take their davinci-002 and give it 1.5 bit resolution?)

1 Like