What criteria are used to determine that newer models are "better"

I am interested in the grammar of LLM written texts and have compared some newer models, like GPT4_Turbo, against older models, like Text-davinci_003.

I have found some remarkable differences in the grammar between the model versions. What is normally compared, if not the grammar, when it is said that the newer model is more capable?

Is there literature, which compares model versions, based on grammar or other criteria?

Thanks for helping me. (Perhaps something interesting is coming from my grammatical analysis)

At the moment the best ways seems to be using them. Everyone uses these tools a little bit differently, so what’s perfect for one person may underperform for another.

That being said, there are are various benchmarks and evaluations that have been made in an attempt to rank LLMs. Below are a few from when GPT-4 was first released.

Each benchmark grades different capabilities (or at least attempts to) so it’s a bit difficult to directly compare models.

Here is a link to a blog post I found after a quick search that discusses GPT4 Turbo.