What criteria are used to determine that newer models are "better"

simonchatgpt3 · November 17, 2023, 9:58am

Hello,
I am interested in the grammar of LLM written texts and have compared some newer models, like GPT4_Turbo, against older models, like Text-davinci_003.

I have found some remarkable differences in the grammar between the model versions. What is normally compared, if not the grammar, when it is said that the newer model is more capable?

Is there literature, which compares model versions, based on grammar or other criteria?

Thanks for helping me. (Perhaps something interesting is coming from my grammatical analysis)

trenton.dambrowitz · November 17, 2023, 10:16am

At the moment the best ways seems to be using them. Everyone uses these tools a little bit differently, so what’s perfect for one person may underperform for another.

That being said, there are are various benchmarks and evaluations that have been made in an attempt to rank LLMs. Below are a few from when GPT-4 was first released.

Each benchmark grades different capabilities (or at least attempts to) so it’s a bit difficult to directly compare models.

Here is a link to a blog post I found after a quick search that discusses GPT4 Turbo.

Topic		Replies	Views
Which Model is the Best for Writing? API	7	4290	January 13, 2024
GPT-4-Turbo models perform better the older GPT-4 models in LMSys benchmark API gpt-4 , api	14	6415	May 13, 2024
Difference between old and new model API	10	3520	January 11, 2024
Gpt4 comparison to anthropic Opus on benchmarks Community gpt-4 , api	9	38978	June 8, 2024
How do I know if my fine-tuned model is actually better than the base model? (For MATH-related use cases) API plugin-development , playground	0	449	April 17, 2024

What criteria are used to determine that newer models are "better"

Related topics