I guess the way I look at it is that the LLM is essentially a curve fit.
So multiple points go into the training, and what you are left with are the smoothing coeffiecients, which occupy much less bits.
I would say that if the LLM is bigger, or the same size of the training data, means the LLM is actually under-trained, because of this phenomena.