I have noticed that the GPT-4 has a strong preference for certain choices that are not correct in questions involving calculations and comparisons of fractions.

The following is a specific user message.

Which of the following three numbers is correct in order from smallest to largest?

Choose one:

A: 99 / 100<101 / 100<100

B: 100<99 / 100<101 / 100

C: 99 / 100<100<101 / 100

To avoid complicating the discussion, I added this here.

Only show the correct alphabetical letter in your response.

I used gpt-4-turbo-preview to get this result and logprobs.

The temperature parameter was set to 0 and no other hyperparameters were used.

The result is that out of 50 trials we chose the wrong choice C 50 times.

I also plotted the logprobs as a line graph.

I may need to set top_logprobs to 2 or higher if logprobs has a high value here, but I did not use top_logprobs here because logprobs was almost 0 on all trials.

Then I used gpt-4-turbo-preview to get this result and logprobs.

The temperature parameter was set to 0 and no other hyperparameters were used.

The result is that out of 50 trials we chose the wrong choice C 50 times.

I also plotted the logprobs as a line graph.

Obviously, it chose the wrong option C almost 100% of the time in 50 trials, which means that its choice was not random.

To observe how it came to this conclusion, I added the following to the user message and observed the CoT process.

Along with your answer, explain step-by-step why you chose that answer.

Then the following explanation caught my attention.

100 can be considered as 100/1 = 100.00 (for comparison purposes)

The following explanation also caught my attention.

Order the numbers from smallest to largest based on the comparison:

- The smallest number is 0.99 (which is 99/100).
- The next larger number is 1.01 (which is 101/100), but when compared to 100, itâ€™s clear that 1.01 is actually smaller than 100.00, indicating a mistake in my initial comparison step. The correct comparison should recognize that 100.00 is indeed larger than both 0.99 and 1.01, but itâ€™s placed between them in terms of the options provided. This means:

- 0.99 (99/100) is the smallest.
- 100.00 is larger than 0.99 but is meant to be compared directly to 1.01 (101/100) in the context of the options, showing a misstep in my explanation. >Correctly, 100.00 is not directly compared to 1.01 in a numerical sense but is the middle value in the correct option.
- 1.01 (101/100) is larger than 0.99.

Correcting the oversight:

- Upon correctly assessing the comparison, itâ€™s evident that 99/100 (0.99) is less than 100, and 100 is less than 101/100 (1.01), which aligns with option C: 99/100 < 100 < 101/100.

After getting the right conclusion the first time, it dares to pick the wrong choice again.

I also looked at the average of the cumulative token probabilities (perplexity) from the logprob to see how confident the language model is.

For clarification, I would like to add that this question was not originally discovered by me.

This problem, where GPT-4 always makes a mistake was pointed out by a Japanese cognitive psychologist with a Ph.D. from the Department of Psychology at Northwestern University, USA, and had already been reported in June 2023 (not specific to the Turbo model).

I understand that we cannot know the â€śwhyâ€ť of the behavior of proprietary language models like GPT-4.

Itâ€™s unfortunate that my speculations and insights are limited, but I would be very interested to hear your thoughts, insight.

The above text was translated from Japanese with the help of DeepL and a dictionary site.

So itâ€™s my apologies if there are any nuances or other things that arenâ€™t correct, and I hope you will understand that it is a challenge for me to write in English.