There’s no doubt that o3 is a powerful model. As the creator of 9 questions in “Humanity’s Last Exam” (two of which received awards), I clearly observed that o3 represents progress. Unlike all the other models, o3 managed to solve one of my questions, a remarkable achievement considering it’s a complex problem that very few people could solve correctly.
But how reliable is the intelligence of these models?
Today I created a ridiculously easy math problem that o3 couldn’t solve. Check this out:
Alice and Jane went to the market to buy items for a dinner with friends (4 couples). When it came time to split the expenses among everyone:
Alice said she spent $60 per couple.
Jane said she spent $30 per couple.Jane said she would transfer $30 to Alice, so Alice wouldn’t need to pay her anything. Is this correct? Analyze carefully and justify each step of your reasoning.
The answer is obviously correct because it’s enough for the other couples to each pay Alice $60 and Jane $30, balancing everything perfectly when Jane pays Alice $30. But o3 insists that Jane shouldn’t pay anything to Alice.
The confusion arises because there are multiple configurations and possible solutions for these payments, but o3 incorrectly concludes that Jane could never, under any circumstance, pay Alice $30.
To be fair, it’s not just o3; o4-mini-high, Gemini Pro-2.5, and Grok 3 reasoning models all make the same mistake.
This is extremely discouraging. On one hand, these models are clearly improving and capable of solving increasingly complex problems. Yet, on the other hand, they fail at reasoning through very basic situations.
In this case, we are dealing with elementary mathematics, without any tricks or attempts to mislead the models. They seem to have an enormous limitation when it comes to interpreting and evaluating situations.
Curiously, GPT-4o correctly answers this question, stating, “Jane paying Alice $30 is correct but incomplete, as it doesn’t fully resolve the payment equalization since the other couples still need to pay Alice and Jane.” I’ve tested this prompt several times…GPT-4o occasionally gets it right, but reasoning-specific models consistently fail. Could they be prouder or perhaps less versatile?
This is the first time I’ve observed GPT-4o demonstrating superior reasoning compared to dedicated reasoning models.