Declining Quality of OpenAI Models Over Time: A Concerning Trend

I’ve noticed a troubling pattern with OpenAI’s release strategy for their language models, particularly GPT-4 and now GPT-4o. It seems they initially release a high-quality product to attract subscribers, but over time, the model’s performance degrades noticeably.

Key observations:

  1. Initial release: The model performs exceptionally well, generating excitement and subscriptions.
  2. Gradual decline: Over weeks or months, users report slower response times, more frequent interruptions mid-response, and overall decreased quality of outputs.
  3. Inconsistent performance: The model becomes less reliable, occasionally “hanging” or producing incomplete responses.
  4. No improvement over time: Instead of refining and enhancing the model, the quality seems to worsen.

This pattern was evident with GPT-4 and appears to be repeating with GPT-4o. It raises questions about OpenAI’s commitment to maintaining service quality for paying subscribers.

As a user, it’s frustrating to pay for a subscription only to see the service deteriorate rather than improve. This approach feels like a bait-and-switch tactic - lure customers in with a premium product, then gradually reduce its quality while still charging the same price.

Has anyone else noticed this trend? What are your thoughts on OpenAI’s approach to model releases and maintenance? Is it still worth subscribing given this pattern of declining quality?

I’m interested in hearing others’ experiences and opinions on this matter. Perhaps there are technical explanations for these issues that I’m not aware of, but from a user perspective, it’s a concerning trend.

3 Likes

I would be curious to see any evidence you’ve collected to substantiate these claims.

If you have benchmarks and graphs that would also be nice to see.

I fully agree with @pawel.patrzek here.

What metrics would satisfy you as far as evidence goes? And how exactly would you benchmark a steadily declining quality of responses? it’s not like ChatGPT users have access to logs or data quality reports from OpenAI. Only they can provide this info. But that doesn’t mean you should dismiss so many users reporting the same problem.

I had a test thread here about an audio project to determine what the responses would be like. Terrible quality responses. It lost context about 3 messages in. Does not track memory contents at all, and instructions are completely ignored. This is true across all threads, regardless of the topic. The clarity and detail in the instructions seem to make no difference at all.

1 Like

I am speculating based on the change in behavior that changes were introduces to prioritize response time over quality, which explains why 4o responds very quickly, but if the context and instructions are being ignored in order to provide fast responses, then the tool is not usable.

When I first started using 4o for help with coding, it was like sitting down with a much more skilled human than myself. It was tracking changes across multiple files, recommending changes across an entire project, it was keeping great awareness of a large context, and it was following instructions.

All of that is gone now.

1 Like

If you constantly get few errors with a row with code generation cut in middle of a process. When you wait for answer generation few minutes because it stuck then what benchmark you need to understand that service getting worse ?

My patience finally ran out today when ChatGPT interrupted code generation seven times in a row. Each time, I had to click “regenerate” and hope for a complete response. This process was so frustrating and time-consuming that I could have written the code faster myself.

The whole point of subscribing to this service is to increase productivity and speed up my workflow. Instead, I found myself wasting time, getting increasingly frustrated, and ultimately working slower than if I hadn’t used the tool at all.

While I understand your frustration, to demonstrate declining quality it’s required to show some type of objective quality measure at some baseline point and show lower quality in that same measure at a later point in time.

Like, if you had some type of benchmark you had run 100 times a month ago and you ran the same benchmark 100 times today, you’d be able to determine the mean and variance of the benchmark scores for the model at each point in time. With that you’d be able to determine if there exists a statistically significant difference in the two models with respect to that particular benchmark and with what degree of certainty you are making that claim.

My only benchmark is only that I see such picture more often

Previously I saw this from time to time and now I have to create new chat window and start conversation from scratch. After ypu reply I tried to regenerate answer. In network tab no error

But also still no answer. So maybe problem is not with model itself but with application. Anyway I treat service as whole.

I am sorry man, but I am a paying user, not an analyst to help OpenAI troubleshoot their service and track metrics to improve service quality.

I understand what you are saying and the need for actual benchmarks, but if you can show me how these benchmarks can be created, I wold be happy to make that happen, although I have stuff to do and no patience for this.

What I know is that about 2 months or 6 weeks ago this was a kick-ass product I could rely on. Today, it is unusable in the ways in which I was depending on it before. It’s unlikely I am imagining this and I have dozens of threads as evidence. The only unseen data is what’s inside the OpenAI black box of spaghetti that makes all this happen. I don’t know what internally determines the change in behavior of ChatGPT. They do. And it would be nice for the to be more…OPEN about how their AI works or doesn’t work especially to people that pay for it.

3 Likes

You can create benchmarks, with a specific coding question, or math question, and see if it can solve it here consistently. There are many ways to chart this.

My entire point here is that nebulous complaints don’t really serve any purpose—they’re not actionable.

Complaints about “quality” aren’t helpful unless it can be shown the quality issues are real, persistent, and replicable.

The models are stochastic, with more than 100-million users across the world using them, some percentage of users will invariably experience streaks in the quality of generations.

If you have 20 great generations in a row followed by 20 poor generations some time later, you might perceive that as a degradation in model quality. This is true from your subjective experience, but it may not be the objective truth. This is why I always respond to posts like this asking for evidence.

OpenAI is constantly evaluating their model quality. A new version of a model may perform better in some evals and worse in others, but if the model performance is better in evals which are more reflective of how the majority of their users are using the models, that model is likely to be put into production even if it performs slightly worse in some evals which reflect less common use.

So, depending on what you specifically are doing, the model performance may genuinely be worse for you.

But, there’s also a third possibility. Your expectations change over time and your perception of the model’s quality may be colored by these changing expectations. This, again, is why it’s critical to have the same set of message exchanges evaluated at different times in order to remove the subjective human experience from the equation.

In my subjective experience, I’ve not experienced any degradation in quality, but I’m also not running benchmarks because this hasn’t been a concern of mine.

1 Like

Ok so please tell me where is problem ?

In 8th and 9th attempt I got

After 8th attempt I changed internet provider so I want to be sure that problem is not on my side

In console it looks that application have problem calling to some service

And even if problem would be on user side after default fetch timeout in chrome 300 second it should call again or give user information that you lost connection. But if I get such error I rather think that it is problem with scaling service during high demand. I started this topic 3h ago and until now I didn’t get finished answer (because I getting various errors) on this thread from image I showing you :slight_smile: Why I don’t get network errors from free version of Gemini or Claude :slight_smile: ? Seriously we aren’t here just because we like to complain. Before I decided to write this thread I wait very long but finally I decided to cancel my subscription. I don’t say that I never come back but if service is just not working then is useless for me. I don’t even say about quality of answers because like you said feeling of it may be more subjective than objective. But those errors and so slow respond is not worth paying :slight_smile:

Hi, personally I wonder if it’s not the “crowd” bias: long-term users and professionals, they know how to get more out of any model and they adopt new models faster then regular users…

Then the crowd arrives with a lower “mean” skill score but in a higher number.

And as more users who are not that skillful in using the models start using the new model, naturally you get more unforeseen edge cases and more general complains. They also might simply outnumber the positive reviews from skilful users.

This theory stays a theory until someone comes here with performance graphs adjusted for big sample of users and their “skilfulness” score…

Application errors are something very different than “declining quality of OpenAI models.”

For this particular issue I would suggest the following,

  1. Ensure you didn’t have any extensions enabled which might be causing issues with the service.
  2. Log out and log back in.
  3. Clear your browser cache.
  4. Try a different browser altogether
  5. Check if the service works in a mobile application
  6. Try using a different network if possible
1 Like

Ok Then I should change topic to declining quality of service :slight_smile:

Belive me man. Like I said I wait long before I wrote this thread. I try to debug this on my own. I read advice form people in internet Reddit - Dive into anything

or even from this forum

But only thing which helps in that situation to start new chat :).

I understand that on high demands service my get problems. I understand that ChatGPT is getting more popular and I won’t be suprised that with higher demands Open AI need to give lighter version of the model or give less computing power for each user so answer are not so good as hey could be. But lets leave that. Service is more often just working slower and I get more errors. Maybe for users who just ask simpler question to chat this is not such problem.

1 Like

Yes. I especially noticed a difference over the last two days, compared to last weekend. I have received a new phrase over and over and over again yesterday and today. It is “ i’m sorry, I can’t assist you with that.” I asked why and it can’t even answer. I have seen this two or three dozen times. I finally figured out that if I take my information and go to 3.5 then it can do it. Then if I take that information and go back to 4 Or 4o Then I can proceed. Then when I get again the message I’m sorry I can’t assist you with that. I repeat and go back to 3.5 again. It is a very frustrating fix

To clarify, my post is about using ChatGPT, not the API. Thank you for the link to this repo, but I am not willing to spend more money on tokens to troubleshoot OpenAI problems, however I will research it for instances where API calls are not working correctly.

Also, there is another thread where I posted screenshots showing the tool completely ignoring instructions. I would be happy to share more with people who think this is user crowd bias. I did not even realize this was the case when I made the original post until I noticed others reporting the same problems independent from me.

1 Like

That’s a new one for me…I never got that answer before.

1 Like

I have been using OpenAI gpt since two weeks after it came out to the public. I am a paid user for over a year. Not new. I am only new to this forum because I am frustrated with the changes.

1 Like

I am using the GPTs I made. But to use 3.5 I must ditch my GPTs. The only way I found around this issue so far is to have the different versions interact with themselves with me as the intermediary

1 Like