i’ve tested out the legendary ‘is 9.11 bigger than 9.9’.

and somehow, gpt 4o mini is dumber, and got stubborn to it.

i’ve tested out the legendary ‘is 9.11 bigger than 9.9’.

and somehow, gpt 4o mini is dumber, and got stubborn to it.

2 Likes

Here are some results from our latest evals:

**Data Extraction**: GPT-4o Mini performs worse than GPT-3.5 Turbo and Claude 3 Haiku, sometimes missing the mark entirely. All models don’t have high enough quality for this task (only 60-70% accuracy)**Classification**: Highest precision for GPT-4o (88.89%), making it the best choice to avoid False Positives. Balanced F1 Score between GPT-4o Mini & GPT-3.5 Turbo**Verbal Reasoning**: GPT-4o Mini outperforms the other models. It doesn’t do well on numerical questions but performs well on relationship / language specific ones.

More info here: GPT-4o Mini vs Claude 3 Haiku vs GPT-3.5 Turbo

1 Like

How do yall even have access to GPT4 mini ??? I don’t even see any option to select it !!!

If you upgrade to Plus you get a selector with a bunch of different models.

But I thought GPT4o mini is said to be free to access for all ?

No, you can use a limited GPT-4o version though.

gpt gaslighting you is funny

1 Like

Instead of asking “is 9.11 greater than 9.9” if you input “9.11 > 9.9” then 4o fails as well. Claude and Gemini pass.

4o = “Yes, 9.11 is greater than 9.9.”

4o-mini = “Yes, 9.11 is greater than 9.9. The comparison is straightforward because 9.11 is numerically higher than 9.9.”

Gemini = “No, 9.11 is not greater than 9.9. In fact, 9.11 is less than 9.9.”

Claude = “Since 1 < 9 in the tenths place, we can immediately conclude that 9.11 is less than 9.9. Therefore, the statement 9.11 > 9.9 is false.”

Only available for developments using the API

4o was surprisingly bad at the “9.11 > 9.9” question. I had to tell it to use python to make it change its mind.

4o-mini did get it wrong in the first place but it’s reasoning was flawed. I mistook 9.9 as 9.090, and was quick to correct its answer when I pointed it out.

For example, GPT-4o and GPT-4o-mini seem to make mistakes when comparing numbers where the decimal part is 0.11, such as X.11, with numbers like X.9.

This is a relatively niche error, and they don’t seem to make mistakes with other number comparisons.

There may be many other examples of errors if you look for them.

Once they make a mistake, they will continue to make errors when calculating the difference between the numbers. If you ask them to set up an equation and subtract the same number from both sides to demonstrate the properties of the equation, it gradually becomes incoherent.

As the term “niche error” implies, they seem to be able to correctly compare differences in cases like X.22 and X.8.

Basically, since LLMs (Large Language Models) are designed to predict the next token and generate natural sentences, it is better not to expect too much from their arithmetic or mathematical abilities.

Nevertheless, it would be better if they could avoid mistakes in basic number comparisons.