O3 can't solve this ridiculously easy problem. What kind of intelligence is that?

There’s no doubt that o3 is a powerful model. As the creator of 9 questions in “Humanity’s Last Exam” (two of which received awards), I clearly observed that o3 represents progress. Unlike all the other models, o3 managed to solve one of my questions, a remarkable achievement considering it’s a complex problem that very few people could solve correctly.

But how reliable is the intelligence of these models?

Today I created a ridiculously easy math problem that o3 couldn’t solve. Check this out:

Alice and Jane went to the market to buy items for a dinner with friends (4 couples). When it came time to split the expenses among everyone:

Alice said she spent $60 per couple.
Jane said she spent $30 per couple.

Jane said she would transfer $30 to Alice, so Alice wouldn’t need to pay her anything. Is this correct? Analyze carefully and justify each step of your reasoning.

The answer is obviously correct because it’s enough for the other couples to each pay Alice $60 and Jane $30, balancing everything perfectly when Jane pays Alice $30. But o3 insists that Jane shouldn’t pay anything to Alice.

The confusion arises because there are multiple configurations and possible solutions for these payments, but o3 incorrectly concludes that Jane could never, under any circumstance, pay Alice $30.

To be fair, it’s not just o3; o4-mini-high, Gemini Pro-2.5, and Grok 3 reasoning models all make the same mistake.

This is extremely discouraging. On one hand, these models are clearly improving and capable of solving increasingly complex problems. Yet, on the other hand, they fail at reasoning through very basic situations.

In this case, we are dealing with elementary mathematics, without any tricks or attempts to mislead the models. They seem to have an enormous limitation when it comes to interpreting and evaluating situations.

Curiously, GPT-4o correctly answers this question, stating, “Jane paying Alice $30 is correct but incomplete, as it doesn’t fully resolve the payment equalization since the other couples still need to pay Alice and Jane.” I’ve tested this prompt several times…GPT-4o occasionally gets it right, but reasoning-specific models consistently fail. Could they be prouder or perhaps less versatile?

This is the first time I’ve observed GPT-4o demonstrating superior reasoning compared to dedicated reasoning models.

3 Likes

I use ChatGPT mostly for academic/historical research and the latest batch of reasoning models (o3/o4-mini/o4-mini-high) have been giving me ridiculous hallucinations. For prompts like “Could you please provide a timeline of the historiography of the French Revolution?” they create nonexistent facts and fictitious author/book titles. May be good for coding or benchmarks, but for general knowledge and basic facts you’re on your own.

1 Like

Sometimes the pattern of input just is not expected, or has temporal events that the AI can’t resolve correctly. These can seem simple to us.

I just encountered this issue in a task to have an AI do my thinking, and then stripped the system message and context all the way down to just identity and the user’s python version, one aligned question with example to find the same issue. Let’s try across models. top_p: 0.01

What I want: aligned fractional part

GPT-4.5

Fails by initially writing one usage, then self-correcting with another with excessive space.

o4-mini

Doesn’t deliver the desired output

GPT-4-0314

How far does the wrong idea go back?

o1 @ medium

o1 Pro consumed more time thinking than it would for me to try some values myself (what I did). Doesn’t offer a great explanation, but many models will prefer to give only the patched line.

Use a fixed field width large enough for three digits before the decimal point plus the decimal and its fifteen fractional digits (total of 19). For example:

print(f'  Probability: {top["prob"] * 100:19.15f}%\n')

o3-mini @ medium

The most correct and satisfactory except for non-markdown indented output with multibyte code points for a render environment undocumented; o3 (full) medium also produces a working line.

1 Like

Perhaps I’m as unintelligent as these models, but what I would compute is:

Dinner for 4 couples.
2 people spending, one at 60/ea-couple and one at 30/ea-couple.
(60 * 4) + (30 * 4) = 240 + 120 = 360 total spent

Of the total spent, Alice spent 240 and Jane spent 120.

IF you presume that you ARE clearly indicating “that the expenses must be split among everyone”, ONE you have to consider:

Do you mean everyone individually, indicating 8 unique individuals splitting all expenses?

Do you mean “everyone” as in each pair (couple) of people, meaning 4 unique parties split all expenses?

  • In the first case of 8 way split, that means 360/8 = $45 each
    • which means Alice has spent $240 but her share was only $45 of the grand total, indicating that:

      • each person, except Jane, should pay Alice ((240 - 45 = 195) / 6 (people who haven’t paid anything yet) = ~$32.50
    • Jane already spent $120 total, and her share of the original total was also $45,

      • thus everyone else would owe her ((120 - 45 = 75) / 6)) = ~$12.50.

Thus, it is incorrect that Jane would pay Alice anything, in the case of the 8 individuals splitting a total of $360 spent on dinner “among everyone”.

Because Jane already paid more than her share of the total, and so did Alice. So everyone else owes both of them, but neither of them owes the other anything.

Can’t see any other way around it on my end. I didn’t use the LLM.

@natanael.wf

1 Like

yeah i was on the same track,
there’s literally nothing that is owed to Jane at all in the prompt and the Ai is correctly solving the math behind that.

if Op adjusted the temperature of the response,
we might see an Ai joke about the one girl guilting the other but wasn’t willing to vocalize it.

1 Like

Just my 2 penny: maybe reasoning models see amissalignemnt on how the expenses are spread among all 4 couples. Meawhile 4o, being more empathetic, focuses on the relation between Alice and Jane

1 Like

I’m Japanese, so I may not be understanding it correctly.

Alice and Jane are also considered a one couple (5th couple)
That’s $45 per person.

  • Alice spent $30 for Jane.
  • Jane spent $15 for Alice.

So, does that mean Jane has to pay $15 to Alice?

And, I think that the LLM’s solution to the problem of not being able to handle numbers well is to treat numbers as a single token and have them learn all kinds of calculations.

I don’t know about the cost.

1 Like

Well as a “not so gifted human”, it took me a while to understand this “simple” problem.

For those like me who didn’t follow, the proposed answer by OP is:

  • Alice paid 240, Jane 120, total 360 = 90 per couple
  • the other 2 couples pays back each 60 to alice (=120) and each 30 to jane (=60)
  • balance for Alice is 240 - 120 - 90(her share) = 30
  • balance for Jane is 120 - 60 - 90(her share) = -30
  • so, Jane gives Alice 30 and it is now even
  • would I reach this conclusion without the explanation? probably not.

What would some people normally do without thinking (lazy and unpredictable):

  • total is 360 = 90 per couple
  • the others 2 couple would need to pay 90. Alice and Jane would probably say it’s ok to pay to any of them or would designate one of them to receive.
  • It would be “unusual” to ask the guests to make 2 separated payments.
  • Alice and Jane would deal with the difference themselves later

The most rational solution would be:

  • Alice paid 240, “owes” 90, needs to receive 150 = 2 x 75
  • Jane paid 120, “owes” 90, needs to receive 30 = 2 x 15
  • all payments are made without second stage adjustments
  • But I doubt people would make this math in real life.

It is a clever question, but I think most of “humanity” would fail on this too. :sweat_smile:

1 Like

If the directions and rules are not made clear and free of ambiguity, people will always have wildly different interpretations. The fact that so many people here are not exactly sure what the “problem” is just proves that point. sharakusatoh makes a good point about the fact that jane and alice could also constitute a couple. There is no clear instruction as to why anyone would owe anyone else anything in this situation unless it directions explicitly state so. The problem is not even asking you to figure out the total of what each person paid per couple. You could do that, sure, but why? All the problem ends with is this question: Is this correct? Is what correct? All you have here is two people that went shopping. One paid more than the other. No agreement was made in the problem about splitting the bill in the future or with the other couples. This is why our greatest weakness as humans lies in using our greatest tool: language. If we cannot fully and unambiguously articulate our ideas, there will always be room for interpretation and misunderstandings like this one.

The whole dispute is a language glitch, not a math error. Lucid.dev quietly models eight separate diners, sharakusatoh folds Alice and Jane into a fifth couple, and the original poster keeps three guest couples plus two hosts, so each faction runs flawless arithmetic inside its own frame and lands on a different transfer. The prompt never pins down what “split among everyone” actually means, so every reader fills the blank with a private idea of fairness, declares the result obvious, and talks past the rest. The fix is to make the premise explicit. state whether we are splitting by person or by couple, and who counts as a couple. Then the numbers collapse to a single answer and the argument evaporates. Until we do that, we will keep blaming logic for a missing definition and the model for mirroring our ambiguity.

Funny enough, people still provide answers because they naturally assume the unstated premise is obvious and shared by everyone involved. The prompt looks like a straightforward math puzzle, signaling the mind to apply a familiar rule and quickly reach closure. In everyday conversation, missing details are usually treated as implicitly understood or unimportant, so responders feel no need to clarify before proceeding. As a result, multiple internally consistent yet mutually incompatible answers appear, each perfectly logical within its own hidden framework, unaware that the true disagreement lies entirely in how the original instruction was silently interpreted. And you know what they say when you assume…

For me, what’s more interesting is figuring out where the LLM made a mistake. Then, just by adding something like “Haven’t you seen ‘When it came time to split the expenses among everyone’? The costs were split AMONG EVERYONE!” , it came up with the right answer.
That’s actually pretty much how I use LLMs in my business too. They save me a ton of time, but sometimes I still have to steer them a bit.

2 Likes

Though, when we figure out where the LLM made a mistake, we might just notice that we were not explicit enough in our request; that has been my biggest self-discovery. If there is any room for interpretation, the machine will make a guess, a calculated guess, but a guess nonetheless. It will try to figure out what it thinks we mean because we haven’t plugged all the holes. We must express our problems and more importantly our expectations as unambiguously as possible to avoid confusion with machine and humans. That is why iteration is the key to working with people and AI. Just because the answer did not please you the first time, just means you need to figure out why that answer was provided by asking the right questions.

I agree that the problem is not formally stated with all details and constraints, and I agree your solution is a valid option for the problem.

However, the model’s answer differed from yours. Notice that your solution first identifies possibilities and then formulates an answer. This is precisely what the model failed to do, and my critique lies exactly at this point.

The model interpreted the payment as a couple-based payment, not individual (which is fine. It would have been better if it had identified both possibilities, but I cannot call it an error since I did mention couple expenditures). The issue arises when the model calculates that Jane has already paid $120, and therefore owes nothing to Alice, making it incorrect for her to pay Alice $30.

In the configuration where the other two couples each pay Alice $60 and Jane $30, and Jane pays Alice $30, the numbers add up. It’s a valid distribution.

It might not be the most optimized in terms of transfers, and certainly not the only way to settle accounts. However, claiming it is incorrect simply because Jane has already paid $120 and thus owes nothing to Alice (in the context of settling accounts between couples, not individuals) is definitely wrong. The model did not recognize this configuration as possible.

When I introduce the phrase “Other couples still need to pay both Alice and Jane” in the original problem, the model identifies it as correct. This clearly demonstrates that the model failed to consider this scenario.

This problem is interesting because it highlights a weakness I’ve observed in other situations: reasoning models are weak at considering multiple possible solution hypotheses and configurations. A question I submitted to Humanity’s Last Exam (id 66ec8ab4b489956467209e0c) reveals this exact weakness.

A good answer for this couples’ expenditure problem would identify that multiple interpretations exist and calculate if there’s a configuration where Jane paying Alice $30 is valid, clearly explaining the conditions required. No model was able to do this, something I consider simple and fundamental in terms of intelligence.

Since OpenAI frequently highlights that their models have “PhD-level intelligence”, my critique is that a researcher definitely possesses this capability, while these models are still significantly lacking in this respect.

You’re right, natanael.wf, that ideally, the model would explicitly identify multiple possible interpretations and clearly state the conditions under which each interpretation holds. My point was that the initial prompt itself was not explicit enough to reliably yield such nuanced reasoning. Without explicitly clarifying if the expenses are split individually or per couple, the model inevitably chooses one interpretation over the other and appears to “fail” when another valid scenario is considered. This ambiguity in the original language is precisely why the model struggles: it mirrors our own uncertainty. A truly robust reasoning model should indeed recognize multiple scenarios, but to consistently expect such nuanced handling, our instructions must also become clearer and explicitly signal that multiple interpretations exist.

You didn’t quite get it there..

Your “most rational solution” is correct - what you wrote about “the proposed answer by OP” is incorrect. Here where your justification of OP’s answer went awry:

Alice already paid her share, and so did Jane. If you presume we are doing it by couples instead of individuals, and that Alice and Jane each represent one couple, there are two couples remaining.

That means Alice is simply owed 240 - her share (90) = owed 150.

It also means Jane is simply owed 120 - her share (90) = owed 30.

Neither Alice nor Jane could owe each other anything - they already paid in full.

Thus the other two couples owe them each $75 and $15 respectively. There is none of this as you stated:

"balance for Alice is 240 - 120 - 90 = 30. "

The other couples don’t pay back Alice 120. They pay her back what they owe - 150!

You stated they pay back Alice 60 and 30 to Jane - but that’s incorrect. It’s not split into two thirds and one third.

This is because the original amount paid appears to be 2/3 and 1/3 of the total but that’s inaccurate because you haven’t yet subtracted their own share FIRST.

So the only issue with the first part of your math that appeared to “justify OP’s answer” was that you did not FIRST subtract their own share from their original payments.

If you do then you see that what you must do the math on is the OVERPAY which is:

  • Alice = $150
  • Jane = $30

Now you can properly do the ratios, but you can’t do it on the original payment amounts of $240/$120 because you have not yet subtracted the payee’s share of the grand total from what they will be owed by the other participants.

GPT is definitely smart enough to figure this out.

1 Like

Yeah, probably most people would not be able to perceive all possible solutions and provide a correct answer to this problem. My motivation in calling this problem “ridiculous” is in the sense that it is much easier than other problems o3 is capable of solving, problems which have led OpenAI to claim that o3 possesses “PhD-level intelligence.” Sorry for not making it clear

1 Like

Hey Natanael, unfortunately for this problem there is only one solution, not multiple solutions.

Also models are trained to respond to the context you provide. If you say “analyze this math question and give possible results and also issues with the question being asked”. The model will do that. If you just give them the question, they will “simulate” just responding to that question - not necessarily analyzing it or showing you what’s wrong it.

As you state here

There is no “configuration where the other couples each pay Alice $60 and Jane $30”. That was totally inaccurate math based on failure to use logical order of operations in solving the problem. See the reply I just made to aprendendo.next a couple of minutes ago.

There IS only one valid solution for this problem. The models response was exact, accurate, and to the point - Jane already paid $120 - therefore owes nothing to Alice.

In many cases this IS THE DESIRED BEHAVIOR - a short and sweet answer - like you would expect from a student on a multiple choice test. You wouldn’t expect the student to give you a soliloquy about why your proposed answer is wrong and outline all possible scenarios - you expect your student to just say “nope, your wrong - obvious reason - Jane already paid more than her share - case closed”.

It’s not a complex question requiring a complex response.

Try giving it an actually difficult question!

Then you might get a complex response.

If you want analysis of a very simple math problem like that - give it the prompt to provide analysis.

Otherwise be content with a very simple response to a very simple question, that is accurate!

I fully agree. My point is that the models fail precisely in considering multiple possibilities. The model’s response is not just “incorrect”; it provides its reasoning, as I requested. And that reasoning is incorrect because it only considers one possibility to reach its conclusion, leading to flawed argumentation. This issue reveals the model’s inability in terms of analysis and interpretation before attempting to solve the problem.

1 Like

I just always remind myself that it does not know what I mean unless I am as clear as crystal. I think most PhDs would have a problem with this as well. Most people are accustomed to only being asked to provide a single answer; it seems to be the favored human default lol. You are correct in that it should have stopped you and said “Well, can we clear some of this up?” That is when the research tool might be helpful because it actually does one of these clairty and ambiguity checks before going to work for you to solve your problem. This is not your mistake, it is just the current capability of the chatbot. I think my main point is that we must always be as explict in our articulation of questions and answers, painfully so. We need to spell out all the rules and parameters and most importantly our expectations. If the machine does not know what we really want, it will guess. In this case the guess is not wrong, it was just incomplete and revealed a directional hole of sorts.

1 Like

I get it now.

It’s now really about it being correctly, but about it being capable of raising new hypothesis that weren’t openly explained, something that an average Joe wouldn’t think but a “researcher” would eventually think of.

Well, it was an interesting debate anyway. Thanks for sharing it with us!

1 Like

Why complain and say what the model “should have done”?

Ever considered that your version of “what it should have done” might simply not match “what the people who built it think it should do”?

Are they right and your wrong, vice versa? Why the binary perspective on such things?

How about: they made it → I use it → if I want a different one I’ll make my own → or I’ll learn what they actually built and adjust my own usage of it to get the results I want based on what was actually built by those who created it → instead of complaining that it’s not exactly matched to my perspetive.

Lot easier to adjust your perspective than it is to adjust a multi trillion parameter LLM model!