Then you’d need to quantify how often, if ever, the model can produce worse outputs.
Imagine we can strictly quantify a model’s strength.
Then say they have a new model that is ten-times better than the old one at everything other than writing salamander-themed haikus, where it’s only half as good.
Should they not release the new model because it produces worse outputs for that narrow use case?
I would argue they should release such a model.
OpenAI, with very few exceptions, knows everything their GPT models have been and our prompted for. They are building models which they hire are generally better overall, but especially are better for most of the things, most of their users, want to do most of the time.
Unfortunately that sometimes means if you really need it want to do something not very many people are doing, a newer model might not be as strung as an older one for that particular thing you want to do…
Sometimes, the goal might be a model that meets most people’s needs but is much smaller and more efficient, so it can do 90% of what they need for half the cost.
The end-goal is better models for everyone, but there are many competing needs and motivations, so the path there is unlikely to be strictly increasing all the time for everyone.