There are many fundamental concerns with this paper.
Here is one demonstrating that GPT-4 actually improved from March to June,
Leetcode accept | june_fixed | june_orig | march_orig |
---|---|---|---|
True | 35 | 5 | 26 |
False | 15 | 45 | 24 |
Source: Deceptive definition of "directly executable" code · Issue #3 · lchen001/LLMDrift · GitHub
I’ve already written my concerns about the methodology for their test of mathematics ability, evaluating if a number is prime or not, which I will attempt to summarize here,
- GPT-3.5 and GPT-4 are large language models, and while math is a language and they have shown emergent capabilities in the field of mathematics, evaluating if a number is prime or not is not a good test of mathematics ability, I would be hesitant to say it is a test of mathematical reasoning at all.
- They tested using only prime numbers. The problem with that is we cannot discern if the models have lost (or gained) reasoning ability or if the models have any bias for answering “yes” or “no” to the question “Is [number] a prime number?” If they had included composite numbers in their tests, we would have a much clearer picture of what is happening here because we could compare the proportion of composite numbers the models identify as prime with the number of prime numbers the models identify as prime.
- They used a
temperature
of0.1
. There is nothing inherently wrong with choosing this temperature, but it,
a. Does not represent the expected behaviour of ChatGPT where the temperature is 1.0.
b. Suggests they should have done more than one replication for each number to account for the variance of the model. Then they could have set a threshold, say 75%, at which the model would be considered to have correctly answered the question. E.g. run each number 20 times. If the model gets the correct answer 15 times or more, it gets credit for being correct on that question.
Now, I haven’t yet had time to dig through the rest of the paper, but if these issues are immediately apparent in a 2-minute read-through of the paper, I suspect there are other issues as well.
It should also be noted that this is a draft paper, it has not been peer-reviewed. I do not imagine this paper would be published in any worthwhile journal in its current state, and I am doubtful the issues could be corrected—especially since the March model is no longer publicly available.