I can show that the smaller GPT-3 models were trained on less data, by omissions so as to not overfit. We test what they know.
ada-001 (instruction-following tune):
We’ll try to make it complete the most obvious line of a song:
The output skips to the next line, with the correct word in
only at 2.32%.
oops, meant to use the base model ada
for completions, here’s that:
Results without poem
tuning is just uninformed repetition, in
at 10%.
So iterate until we find knowledge. babbage
gives in
as likely as a comma, but a new line wins by far (then repeating the input):
I tried to make it even more obvious for curie, “just give me the last two words of a line”, but still wrong:
Finally davinci
can do it, with its 25x jump in model parameters.
We can see having the knowledge gives AI a massive jump to 97.7%:
Interestingly, the new babbage-002
gives us hints it has been trained on large data, but instead we get the model’s extreme perplexity and quick degradation, making it often worse than ada for general tasks:
(Did they just take their davinci-002
and give it 1.5 bit resolution?)