Experiencing Decreased Performance with ChatGPT-4

anon22939549 · July 19, 2023, 6:03am

I’ve read the paper, and it’s certainly interesting.

One thing that did stand out to me though is they did not publish any system prompts they used and it is known that 0613 was supposed to be better at adhering to the system prompt than 0314 was.

Their GitHub repo also doesn’t include any code for reproducibility, so it’s impossible to see what any other parameters might have been.

Finally, I think their methodology is flawed (at least with respect to the mathematics test of determining whether a number was prime or not).

First, identifying prime numbers was never something GPT models were ever good at nor was it something they were supposed to be good at.

Second, the models were prompted to think step by step, which is fine but in looking through their results it appears that in several instances there are errors in those steps even when the model correctly guesses the right answer.

All of the numbers which were tested were, in fact, prime. So, this test could very likely be (and probably is) just testing how often a model guesses a number is prime.

Beyond that, they only appear to have tested each number once with each model, which would be fine had they used a temperature of 0, but they were using a temperature of 0.1, so I would have preferred to have seen more runs with each test case.

For this particular test, I would have wanted to see maybe 50 primers and 50 non-primes, each run 10–50 times. That way we could get some good idea how likely each model is to call any particular number prime which would help put these numbers into perspective.

An even better test would have been to ask the models to factor the integers which is analogous to determining primality.

So, while I do find this interesting, I don’t find it overly compelling. I’ll try to take a look at some of the other results tomorrow and see if they are stronger than the prime_eval test.

Topic		Replies	Views
Has anyone noticed GPT4o quality drop last few days? Feedback	86	6345	January 8, 2025
Why I Think GPT Is Now Lazy Community gpt-4 , chatgpt	30	18940	February 6, 2024
GPT4-Turbo more "stupid/lazy" - It's not a GPT4 API gpt-4 , chatgpt , gpt-4-turbo	33	11349	March 18, 2024
GPT-4 vs GPT-4o? Which is the better? Community gpt-4	80	255143	May 1, 2025
Error after error tonight Community gpt-4 , chatgpt	29	2368	August 16, 2024

Experiencing Decreased Performance with ChatGPT-4

Related topics