GPT-4 is getting worse over time, not better.

1/ Many people have reported noticing a significant degradation in the quality of the model responses, but so far, it was all anecdotal.

2/ But now we know. At least one study shows how the June version of GPT-4 is objectively worse than the version released in March on a few tasks.

3/ The team evaluated the models using a dataset of 500 problems where the models had to figure out whether a given integer was prime. In March, GPT-4 answered correctly 488 of these questions. In June, it only got 12 correct answers.

4/ From 97.6% success rate down to 2.4%! :scream:

5/ But it gets worse! The team used Chain-of-Thought to help the model reason: “Is 17077 a prime number? Think step by step.”

6/ Chain-of-Thought is a popular technique that significantly improves answers. Unfortunately, the latest version of GPT-4 did not generate intermediate steps and instead answered incorrectly with a simple “No.”

7/ Code generation has also gotten worse. The team built a dataset with 50 easy problems from LeetCode and measured how many GPT-4 answers ran without any changes.

8/ The March version succeeded in 52% of the problems, but this dropped to a pale 10% using the model from June.

9/ Why is this happening? We assume that OpenAI pushes changes continuously, but we don’t know how the process works and how they evaluate whether the models are improving or regressing.

10/ Rumors suggest they are using several smaller and specialized GPT-4 models that act similarly to a large model but are less expensive to run. When a user asks a question, the system decides which model to send the query to.

11/ Cheaper and faster, but could this new approach be the problem behind the degradation in quality?

12/ In my opinion, this is a red flag for anyone building applications that rely on GPT-4. Having the behavior of an LLM change over time is not acceptable.

13/ Have you noticed any issues when using GPT-4 and ChatGPT lately? Do you think these problems are overblown?


To be fair, I really don’t care about GPT’s ability to solve. That’s never been an issue since the days of programming.

Can it create the problem? Yes, 100%. Much better than before.

Regardless, I think enough people are noticing an issue, and myself. I have noticed some degrading qualities of GPT. I still believe it’s much better than previous version. But, I still have noticed issues.

Of course, the Mixture of Experts. It’s definitely easier to have a classifier model determine who can answer the question appropriately, rather than an AGI. But, I always thought OpenAI was a research company, not trying to appeal to the masses. I can understand that money is money, but, it’s not like we aren’t willing to pay the difference. A common theme of OpenAI is, like it’s own model, to be a blackbox. I’ve given up trying to follow and understand, and rather just am enjoying the ride.

But, of course. Academics will say “show the proof”. “Submit an eval”. Gosh, when I first started using GPT I was very excited about the concept of a “conversational agent”, but, it has deviated into a “reasoning engine”. For some reason there’s a infatuation in its ability to solve. Why? Why solve? Even humans tend to use calculator to solve, our ability is on the other end.

And what about evals? Single conversation pairs? Measuring how it can respond it a single question? Boring. I want to see how it can manage, and continue a conversation. Maximize it’s purpose. Attention is it’s strong suit. It’s not a simplistic function.

My lord. A thousand people can scream “It looks like this building is going to collapse!!”. Or, a thousand people can say “This structure is going to fail!”. Yet, “Nope, calculations were made, our tests say that you’re wrong and we won’t listen until you can prove us wrong”. In all their glory. Until the unthinkable happens.

Still, I do prefer the GPT of now, to before. But only because I’ve adapted to its purpose and find more $$$$$ value.

