What is the size of the training set for GPT-3

I guess the way I look at it is that the LLM is essentially a curve fit.

So multiple points go into the training, and what you are left with are the smoothing coeffiecients, which occupy much less bits.

I would say that if the LLM is bigger, or the same size of the training data, means the LLM is actually under-trained, because of this phenomena.