I have seen this picture regarding GPT-4 (pic related) But i can’t seem to find anything that confirms this, so is it real?

If it is real, what do you think about it?

I have seen this picture regarding GPT-4 (pic related) But i can’t seem to find anything that confirms this, so is it real?

If it is real, what do you think about it?

1 Like

I have heard the same rumors, but I highly doubt it would get that big. The training and operational costs alone would be ginormous (aka are you ready to pay $1 - $10 per completion?)

Also, with that many parameters, the model would likely be undertrained, and have bad performance, even for a multi-modal model (aka they rolled in Dalle-2, Codex, Whisper, etc all in the same big model).

So, in a nutshell I would be shocked if they went to 100T parameters. But 1T seems reasonable.

Yeah more parameters does not always equal better output I believe

2 Likes

Yeah, I think Facebook’s version is larger but doesn’t have as good results.

Speed and reliability are going to be even more important too, I think.

1 Like

Exactly @jamiecropley and @PaulBellow

In my experience, there needs to be a reasonable proportion of training data to coeffecients/parameters in a model. You need #TrainingData >> K * #Coefficients to get good results (for some positive constant K, think of this as an oversampling factor that gives you your averaging/filtering/noise-reduction). If you have #TrainingData < K* #Coeffiecients, and you have the model trained well (or at least you think) you then ultimatley end up with an overtrained system that is incapable of generalization.

The analogy is looking at a scatter of dots about a line (that is, line + noise). So assume you have a line and just add some noise to it. Now instead of fitting a line (2 coefficients) to this noisy set of data to estimate the original line, say you fit a 9th degree polynomial to this noisy data (10 coefficients) … you could get a better fit (at first) but then as more and more noisy dots come in later from the same distribution, you get more errors in the 9th degree fit than the 1st order (linear) fit. *This is because 9th order polynomial is fitting the noise!* Deep Learning models and their parameters/coefficients operate on the same principals, just in higher dimensions.

So the number of coefficients in the model has to support the *volume* (oversampling factor) and the *trend* (underlying shape) of the training data, otherwise you are in the overfit situation.

As a more concrete recent example, look at the latest embedding option from OpenAI. It’s down to 1.5k dimensions and for most embeddings replaces ALL previous models in performance (with a few noted exceptions). *This even replaces the davinci 14k dimension embedding model!*. Why? Well, they aren’t overfitting anymore, even with a 4x larger input window!

This is a testament to overfitting in their prior attempts.

Now, having said all this. Here’s what I think **and please chime in if you differ** … We can have properly fitted AI systems with 100T parameters someday. But we need more training data (oversampling) and more underlying variation in the data (shape) to support this. And obviously need the computing/watt costs in check too (quantum?). So is the BIG model coming, yes I think so, but it will be GPT-6 or 7 to 8 (depending on other research breakthroughs in hardware and algorithms) in the future, but not this year.

What do you think?

1 Like