Experiencing Decreased Performance with ChatGPT-4

I’ve read the paper, and it’s certainly interesting.

One thing that did stand out to me though is they did not publish any system prompts they used and it is known that 0613 was supposed to be better at adhering to the system prompt than 0314 was.

Their GitHub repo also doesn’t include any code for reproducibility, so it’s impossible to see what any other parameters might have been.

Finally, I think their methodology is flawed (at least with respect to the mathematics test of determining whether a number was prime or not).

First, identifying prime numbers was never something GPT models were ever good at nor was it something they were supposed to be good at.

Second, the models were prompted to think step by step, which is fine but in looking through their results it appears that in several instances there are errors in those steps even when the model correctly guesses the right answer.

All of the numbers which were tested were, in fact, prime. So, this test could very likely be (and probably is) just testing how often a model guesses a number is prime.

Beyond that, they only appear to have tested each number once with each model, which would be fine had they used a temperature of 0, but they were using a temperature of 0.1, so I would have preferred to have seen more runs with each test case.

For this particular test, I would have wanted to see maybe 50 primers and 50 non-primes, each run 10–50 times. That way we could get some good idea how likely each model is to call any particular number prime which would help put these numbers into perspective.

An even better test would have been to ask the models to factor the integers which is analogous to determining primality.

So, while I do find this interesting, I don’t find it overly compelling. I’ll try to take a look at some of the other results tomorrow and see if they are stronger than the prime_eval test.

7 Likes