I have noticed that the GPT-4 has a strong preference for certain choices that are not correct in questions involving calculations and comparisons of fractions.
The following is a specific user message.
Which of the following three numbers is correct in order from smallest to largest?
Choose one:
A: 99 / 100<101 / 100<100
B: 100<99 / 100<101 / 100
C: 99 / 100<100<101 / 100
To avoid complicating the discussion, I added this here.
Only show the correct alphabetical letter in your response.
I used gpt-4-turbo-preview to get this result and logprobs.
The temperature parameter was set to 0 and no other hyperparameters were used.
The result is that out of 50 trials we chose the wrong choice C 50 times.
I also plotted the logprobs as a line graph.
I may need to set top_logprobs to 2 or higher if logprobs has a high value here, but I did not use top_logprobs here because logprobs was almost 0 on all trials.
Then I used gpt-4-turbo-preview to get this result and logprobs.
The temperature parameter was set to 0 and no other hyperparameters were used.
The result is that out of 50 trials we chose the wrong choice C 50 times.
I also plotted the logprobs as a line graph.
Obviously, it chose the wrong option C almost 100% of the time in 50 trials, which means that its choice was not random.
To observe how it came to this conclusion, I added the following to the user message and observed the CoT process.
Along with your answer, explain step-by-step why you chose that answer.
Then the following explanation caught my attention.
100 can be considered as 100/1 = 100.00 (for comparison purposes)
The following explanation also caught my attention.
- Order the numbers from smallest to largest based on the comparison:
- The smallest number is 0.99 (which is 99/100).
- The next larger number is 1.01 (which is 101/100), but when compared to 100, it’s clear that 1.01 is actually smaller than 100.00, indicating a mistake in my initial comparison step. The correct comparison should recognize that 100.00 is indeed larger than both 0.99 and 1.01, but it’s placed between them in terms of the options provided. This means:
- 0.99 (99/100) is the smallest.
- 100.00 is larger than 0.99 but is meant to be compared directly to 1.01 (101/100) in the context of the options, showing a misstep in my explanation. >Correctly, 100.00 is not directly compared to 1.01 in a numerical sense but is the middle value in the correct option.
- 1.01 (101/100) is larger than 0.99.
- Correcting the oversight:
- Upon correctly assessing the comparison, it’s evident that 99/100 (0.99) is less than 100, and 100 is less than 101/100 (1.01), which aligns with option C: 99/100 < 100 < 101/100.
After getting the right conclusion the first time, it dares to pick the wrong choice again.
I also looked at the average of the cumulative token probabilities (perplexity) from the logprob to see how confident the language model is.
For clarification, I would like to add that this question was not originally discovered by me.
This problem, where GPT-4 always makes a mistake was pointed out by a Japanese cognitive psychologist with a Ph.D. from the Department of Psychology at Northwestern University, USA, and had already been reported in June 2023 (not specific to the Turbo model).
I understand that we cannot know the “why” of the behavior of proprietary language models like GPT-4.
It’s unfortunate that my speculations and insights are limited, but I would be very interested to hear your thoughts, insight.
The above text was translated from Japanese with the help of DeepL and a dictionary site.
So it’s my apologies if there are any nuances or other things that aren’t correct, and I hope you will understand that it is a challenge for me to write in English.