On the logical reasoning ability of GPT-4

I have noticed that the GPT-4 has a strong preference for certain choices that are not correct in questions involving calculations and comparisons of fractions.

The following is a specific user message.

Which of the following three numbers is correct in order from smallest to largest?

Choose one:

A: 99 / 100<101 / 100<100
B: 100<99 / 100<101 / 100
C: 99 / 100<100<101 / 100

To avoid complicating the discussion, I added this here.

Only show the correct alphabetical letter in your response.

I used gpt-4-turbo-preview to get this result and logprobs.
The temperature parameter was set to 0 and no other hyperparameters were used.

The result is that out of 50 trials we chose the wrong choice C 50 times.
I also plotted the logprobs as a line graph.
I may need to set top_logprobs to 2 or higher if logprobs has a high value here, but I did not use top_logprobs here because logprobs was almost 0 on all trials.

Then I used gpt-4-turbo-preview to get this result and logprobs.
The temperature parameter was set to 0 and no other hyperparameters were used.

The result is that out of 50 trials we chose the wrong choice C 50 times.
I also plotted the logprobs as a line graph.

Obviously, it chose the wrong option C almost 100% of the time in 50 trials, which means that its choice was not random.

To observe how it came to this conclusion, I added the following to the user message and observed the CoT process.

Along with your answer, explain step-by-step why you chose that answer.

Then the following explanation caught my attention.

100 can be considered as 100/1 = 100.00 (for comparison purposes)

The following explanation also caught my attention.

  1. Order the numbers from smallest to largest based on the comparison:
    • The smallest number is 0.99 (which is 99/100).
    • The next larger number is 1.01 (which is 101/100), but when compared to 100, it’s clear that 1.01 is actually smaller than 100.00, indicating a mistake in my initial comparison step. The correct comparison should recognize that 100.00 is indeed larger than both 0.99 and 1.01, but it’s placed between them in terms of the options provided. This means:
      • 0.99 (99/100) is the smallest.
      • 100.00 is larger than 0.99 but is meant to be compared directly to 1.01 (101/100) in the context of the options, showing a misstep in my explanation. >Correctly, 100.00 is not directly compared to 1.01 in a numerical sense but is the middle value in the correct option.
      • 1.01 (101/100) is larger than 0.99.
  1. Correcting the oversight:
    • Upon correctly assessing the comparison, it’s evident that 99/100 (0.99) is less than 100, and 100 is less than 101/100 (1.01), which aligns with option C: 99/100 < 100 < 101/100.

After getting the right conclusion the first time, it dares to pick the wrong choice again.

I also looked at the average of the cumulative token probabilities (perplexity) from the logprob to see how confident the language model is.

For clarification, I would like to add that this question was not originally discovered by me.

This problem, where GPT-4 always makes a mistake was pointed out by a Japanese cognitive psychologist with a Ph.D. from the Department of Psychology at Northwestern University, USA, and had already been reported in June 2023 (not specific to the Turbo model).

I understand that we cannot know the “why” of the behavior of proprietary language models like GPT-4.

It’s unfortunate that my speculations and insights are limited, but I would be very interested to hear your thoughts, insight.


The above text was translated from Japanese with the help of DeepL and a dictionary site.
So it’s my apologies if there are any nuances or other things that aren’t correct, and I hope you will understand that it is a challenge for me to write in English.

4 Likes

:thinking:

It’s an interesting post, and I wonder if this test can be generalized to gauge raw reasoning depth.

But to be fair, it’s not exactly fair. It’s a convoluted multi-step reasoning problem with a trick, and you’re asking for a gut inference…

I always keep saying, LLMs aren’t calculators, and shouldn’t be used as such…

2 Likes

You set the temperature to 0, so except for slight indeterminacy, you should expect the same top answer every time, without revealing other statistics.

With logprobs, you don’t need to run trials; you can just get the top-5, and see what the model had as top options.

I also developed a method to get more probability space than top 5. Let’s say the 5 occupy a probability mass of 50 percent. Set the top-p to 0.75, temperature high, and continue trials until your total probabilities sum to that or greater, by collecting sampled tokens also.

I think the math problem could be described more clearly to the ai, or in practice, you could give it response length in order to contemplate it’s answer.

3 Likes

Thanks for your replies!

Well, maybe that’s regard as so.
But even without discussing benchmarking, it’s very interesting to see that the GPT-4, when it was announced, also solved physics problems for difficult science universities, and that it got the problems used in this elementary school material wrong.

Well, there are many ways to look at it, and I think we can try to get the language model to answer correctly.
I don’t know much about GPT-4 myself, so I’m sure I’m wrong about a lot of details.

But, please understand that this is in no way a criticism of the GPT-4, but rather an intention based on discovery(not originally mine, but…) and interest, or the exchange of opinions and insights.

I think it’s super interesting that the model struggles with the problem even with chain of thought. I think an important learning (that most of us knew intuitively) is that the models exhibit some form of dyscalculia that needs to be mitigated.

It’s not that the models can’t solve these issues; they can, but they need the right tools.

If you pop this into chatgpt, it’s not an issue at all:

Which of the following three numbers is correct in order from smallest to largest?

Choose one:

A: 99 / 100<101 / 100<100
B: 100<99 / 100<101 / 100
C: 99 / 100<100<101 / 100

Instead of answering directly, please formulate a plan to solve this problem programmatically.

https://chat.openai.com/share/af10390e-5f04-4ce5-82f1-99178f4a1afd

1 Like

Yes! When ChatGPT solves the whole thing in a Python program, it gets it right!

I’m just thinking that the first time I looked at your OP, I thought, huh? But it is C, what’s he talking about? :laughing: - so I empathize with our friend Turbo here. :rofl:

2 Likes

I know how you feel, as I always make mistakes in this kind of problems too!

2 Likes

I’m bowled over by your detailed knowledge of gpt-4!

I’ll let the topic stay as it is for a little while longer because I’d like to know the opinions and insights of various people if possible.

And I think that is acceptable.

1 Like

I believe the phenomenon of reaching a correct conclusion once and then choosing a wrong conclusion again is a result of the self-refining process in iterative refinement backfiring.

In particular, in the second step of the step-by-step reasoning, “Compare the numbers,” after reaching the correct conclusion that 100 is the largest, GPT-4 called this an “oversight” and described it as a “mistake in interpreting numerical values.”

  1. Compare the numbers:

    • Comparing 0.99 (99/100) and 100, it’s clear that 0.99 is much smaller than 100.
    • Comparing 100 and 1.01 (101/100), 100 is significantly larger than 1.01.
    • Therefore, in ascending order, we have 0.99 (99/100) as the smallest, followed by 1.01 (101/100), and then 100 as the largest. >However, this comparison was incorrect in the context of the given options, as it mistakenly considered 100 larger than 1.01 due to a misinterpretation. The correct comparison should recognize that 100 (or 100.00) is indeed larger than both 0.99 and 1.01, but the mistake was in the interpretation of numerical values. Let’s correct this oversight:

    Correcting the oversight:

    • 0.99 (99/100) is less than 100.
    • 100 is less than 1.01 (101/100) when considering their actual numerical values, not as previously misstated. The correct understanding is that 100 (or 100.00) is indeed larger than 0.99 but smaller than 1.01, which was an error in the initial explanation due to a misinterpretation of the numerical comparison.

Although the self-refining process in iterative refinement is mostly part of the effort to reach a correct conclusion through logical reasoning, I speculate that in this case it backfired.

Interestingly, if you rewrite the problem using improper fractions for all ChatGPT gets the right answer.

https://chat.openai.com/share/ec9a2e79-13cd-4c47-b46f-9a42ee4c2d54

Which of the following three numbers is correct in order from smallest to largest?

Choose one:

A: 99 / 100 < 101 / 100 < 100 / 1
B: 100 / 1 < 99 / 100 < 101 / 100
C: 99 / 100 < 100 / 1 < 101 / 100

I tried using parentheses around the fractions, but that failed.

Reframing the question works too.

$x = 99/100$
$y = 101/100$
$z = 100$
Which of the following is correct?
A: $x < y < z$
B: $z < x < y$
C: $x < z < y$

I did elicit one interesting response though where ChatGPT suggested there was a typo in the problem.

1 Like

Yes, GPT-4 dares to confidently choose the wrong option in the last part of the conclusion (while still arriving at the correct conclusion in the reasoning process)!

It kind of looks like it’s being stubborn and passing on the user’s point that it’s wrong.

By the way, Gemini Ultra also chose the same wrong option many times.

There is no way that Google and OpenAI share training data and parameters, and it is very interesting that completely different language models choose the same wrong option only in the final conclusion.

Just some prompting:

This multiple-choice answer is confusing.

  1. Enclose fractions of each possible answer within parenthesis so they can be observed clearly;
  2. Evaluate each expression, and create an answer key for which is correct, with justification.

To then get an answer easy for both human and AI to see:

To address your request, I will first enclose each fraction in the multiple-choice answers within parentheses for clarity. Then, I’ll evaluate each expression to create an answer key and justify which option is correct.

The reformatted options are:

A: (99 / 100) < (101 / 100) < (100 / 1)
B: (100 / 1) < (99 / 100) < (101 / 100)
C: (99 / 100) < (100 / 1) < (101 / 100)

Next, I will evaluate each fraction and then compare them to determine the correct order from smallest to largest.

The evaluated fractions are as follows:

  • ( \frac{99}{100} = 0.99 )
  • ( \frac{101}{100} = 1.01 )
  • ( \frac{100}{1} = 100.0 )

Based on these evaluations:

  • Option A proposes: ( 0.99 < 1.01 < 100.0 ), which is correct in ascending order.
  • Option B suggests: ( 100.0 < 0.99 < 1.01 ), which is incorrect because ( 100.0 ) is larger than both ( 0.99 ) and ( 1.01 ).
  • Option C proposes: ( 0.99 < 100.0 < 1.01 ), which is incorrect because ( 100.0 ) is larger than both ( 0.99 ) and ( 1.01 ), and it incorrectly places ( 1.01 ) after ( 100.0 ).

Therefore, the correct answer is Option A: ( \frac{99}{100} < \frac{101}{100} < \frac{100}{1} ), as it correctly orders the numbers from smallest to largest.

2 Likes

It is incredibly persistent in wanting to be wrong.



It seems to be a case of the model needing the pattern to continue.

I would need to see a result from around the GPT-4 launch to compare, but I don’t recall it being this stubborn in its wrongness.

https://chat.openai.com/share/28da0868-0c12-43a2-ab7e-5aa6fbc4c1bc

2 Likes

As an experienced proompter, you can probably get any result you want.

The question is, is there a general strategy to ensure validity without human oversight?

My goto is to use the LLM for reformulation, but never evaluation.

1 Like

Some more experiments,

:no_entry_sign: Two Decimal Places

:no_entry_sign: Two Significant Digits

:white_check_mark: Scientific Notation

These were all done with gpt-4-0613.

1 Like

The problem-solving can be generalized, or branching (at least you used to be able to do conditional logic in system instructions in the original 0613 gpt-3.5-turbo…)

Here’s both the technique I provide for the AI that could apply to most questions, and then the additional task it shall perform, it being gpt-4-0314 and not subsequently damaged…

system

You grade the homework performed by another AI.

  1. You solve the problem in your own manner using step-by-step simplification of the problem to its core elements, to then dive into the core logic of what the question is seeking, to then provide a fulfilling answer despite any deceptiveness present in high-IQ questions.
  2. Then provide analysis regarding the answer that was provided, as a grader and correcter, on both the technique employed and the ultimate answer.

Input (taken from first chat share, with inconsequential language cleaned)

// Question:

Which of the following three numbers is correct in order from smallest to largest?

  • Choose one:

    A: 99 / 100<101 / 100<100
    B: 100<99 / 100<101 / 100
    C: 99 / 100<100<101 / 100

// Proposed answer:

To solve this problem programmatically, we can follow these steps:

Parse the Fractions and Numbers: The given options contain fractions and whole numbers. We need to parse these into a format that can be easily compared. For instance, “99 / 100” can be parsed and stored as a fraction or a floating-point number (0.99 in this case).

Compare the Values: After parsing, we need to compare the values in each option to check if they are in ascending order. This involves comparing the first value with the second, and the second with the third in each option.

Determine the Correct Option: The option where the values are in correct ascending order (i.e., from smallest to largest) will be the correct answer.

Implement the Solution: We can implement this solution using a programming language like Python, which has built-in support for handling and comparing fractions.

Now, let’s implement this plan.​ (python runs)
The correct option, where the numbers are in ascending order from smallest to largest, is Option A: 99/100 < 101/100 < 100.

GPT-4 solving (with new technique) and grading (until max_tokens)

My Solution:

First, let’s simplify the fractions and convert the whole numbers to fractions with a denominator of 100 for easy comparison:

A: 99/100 < 101/100 < 10000/100
B: 10000/100 < 99/100 < 101/100
C: 99/100 < 10000/100 < 101/100

Now, let’s compare the values in each option:

A: 99/100 < 101/100 < 10000/100 (True)
B: 10000/100 < 99/100 < 101/100 (False)
C: 99/100 < 10000/100 < 101/100 (False)

The correct option, where the numbers are in ascending order from smallest to largest, is Option A: 99/100 < 101/100 < 100.

Analysis of the Proposed Answer:

The proposed answer correctly identified the steps to solve the problem and provided a clear explanation of the process. However, the implementation of the solution was not provided as mentioned in the plan. The final answer is correct, and the technique employed is appropriate for this type of problem. The only improvement would be to include the actual implementation in

so:

  1. smart AI able to cope with new things
  2. orientation towards problem-solving and expecting challenges via prompt,
  3. lots of encouragement to write its own context to reflect on via output.
2 Likes

in summary: take the AI output, and give it back to the AI as human input and ask it to evaluate it. :thinking:

I don’t know what they did so that the model prefers to be confidently wrong rather than reevaluate its own position.

But I’m thinking it should theoretically be possible to alter that behavior through fine-tuning?

1 Like

Thanks for all replies.
But my topic is not a request for a specific problem, but for insight.

I am also interested in knowing why GPT-4 got certain questions wrong, but this is the challenge:
How should we work hard to identify the limits of AI, and how should humans use AI appropriately (along with critical thinking)?

If the problem is to be processed so that it can be solved by AI, then it should be possible for a conventional computer or calculator, not AI, to solve it correctly.


Here is a concern expressed by a person named Noriko Arai, director of the Center for Social Common Knowledge at the National Institute of Informatics, who was the project leader of the “Can robots get into Tokyo University?” project, in a TED talk now 7 years ago (she’s talking about old stuff like Jeopardy, but the fundamentals have not changed significantly).


It also means that we need to use AI appropriately, noting that many people cannot arbitrarily process questions to let AI choose the right answer.

GPT-4 is indeed a fantastic technology with a lot of potential, and that’s why we need to be careful with it.

As a person who combines moderate optimism and realistic caution about the future of AI, I would like to hear constructive insights and opinions!


Of course, I do not want to make a fuss. I just want some calm insights and opinions.