Perceived Drop in GPT-5 Quality Over the Last Few Weeks

This week has been filled with disappointment regarding GPT-5. I noticed a major change in the model’s behavior, especially in tasks that used to be simple to solve.

I do not have technical evidence or concrete proof to state this with certainty, but my user experience has changed a lot. The GPT-5 I am using now does not feel like the same model from launch.

I tried to complete the same task dozens of times. I adjusted the prompts several times, made the instructions clearer, rewrote the requests, explained the goal in different ways, and even then the model kept delivering almost the same result. In some cases, I asked it to redo the task more than five times, always reinforcing what needed to be corrected, but it ignored important parts of the instructions and repeated the same mistakes.

The behavior became very similar to what I have seen in other models: instead of following the request precisely, it tries to “work around” the problem, creates messy fixes, invents improvised solutions, generates unnecessary scripts, and when I ask for adjustments, the result often becomes even worse. The model feels less careful, less consistent, and less obedient to instructions.

I reached the point where I had to create a skill with detailed instructions, along with much more structured prompts, just to try to get a minimally acceptable result. Even then, I could not reach the same level of quality that I used to get before with a simple prompt and, at most, one or two adjustments.

My impression is that the current model may not be exactly the same as the launch version. It feels like a smaller, more limited variant, or like some kind of optimization has been applied that reduced the quality of the responses. I know this is only a personal assumption, but the difference in behavior is too significant to ignore.

One clear example is GPT-5’s reasoning mode. Before, it seemed to think for longer, analyze the problem better, look for sources when necessary, and deliver more consistent answers. Now, many times, the reasoning seems to last only two or three seconds before it starts answering. This behavior reminds me a lot of smaller models or models optimized to answer quickly, but with less depth.

The issue is not just one bad response. The issue is the repeated pattern: ignored instructions, poor self-correction, improvised solutions, loss of context, lack of depth, and difficulty delivering something that used to be relatively simple.

I would like to know whether other people have also noticed this change in GPT-5 over the last few weeks. For me, the current experience is far below the quality the model showed at launch.

There’s several gpt 5 models. Which have you been using?

There is only one GPT-5 available in the Codex. I am using it at maximum capacity for reasoning.

Are you noticing this inside the repositories where the documentation recently grew from what it was before?

I’m asking that because the codecs might be producing useful but sometimes worthy documentation and instructions, which often impact the context window, which may reduce the quality of the output of the model.

I noticed that personally in my repositories where the documentation is kind of important and needed, but “too much”, the whole thing might just lose quality because of how documentation spills into the context window.

When you start on lighter fresher repositories, codex especially extra high model 5.5 works perfectly.

Oh is it the app or Codex within ChatGPT? In my Codex CLI I can select 5.3, 5.4 or 5.5 and various levels of reasoning and two levels of speed …

The desktop application also has these options. I’m using GPT 5.5 with maximum reasoning capacity, and that’s where I’m having problems. The 5.4 model, while not as efficient, doesn’t ignore instructions.

which GPT 5 model do you mean?
In the Codex desktop app I see 5.2, 5.3, 5.4 mini, 5.4, and 5.5. All with various reasoning and speed options. but no singular GPT 5.

I havent noticed any lower quality tbh. I have primarily been just using 5.5 and 5.3 for some planning, and they have seemed fine. i know there was some token consumption issues a bit ago, but that seems fine now. But as far as quality goes, I havent seen a drop or downgrade.

Is there any other info or examples you can give that could help troubleshoot this apaprent drop in quality? And also specify which GPT 5.x you mean?

I think I replied as you posted, so you answered my question already haha.

But ok, with the info you now gave. Hmmm… that is odd. I have not seen these issues at all with 5.5. I have been using med-high reasoning with that model and have not experienced any of those issues. I have had good success as well with 5.4 and have been trying to use that for planning more than 5.3 since that will be sunset soon enough.

I’m using template 5.5 with Maximum Reasoning, and I didn’t experience this issue with previous templates.

The problem appears specifically when using version 5.5. When the context is relatively small, it works well. However, when asking it to create a longer document, such as a 30-page file, the output becomes problematic. It often includes repeated sections across multiple pages, duplicated content, and text that is meaningless or out of context.

This makes it difficult to rely on the model for long-form document generation, especially when consistency, structure, and coherence are required.

That’s good to devour a load of tokens. Why do you need such high levels of reasoning? What kind of document are you creating and can you instead build it in chapters?

A 30-page document is approximately 15,000 tokens, so I don’t believe this should be a problem for GPT-5.5, since the context window still has plenty of room available.

The issue does not appear to be caused by context length alone. Even with enough remaining context, the model starts producing repeated sections, duplicated pages, and content that loses coherence or becomes unrelated to the original request. This suggests a problem with long-form generation quality rather than a simple context-window limitation.

You didn’t really answer my question.

Why do you need to use such high levels of reasoning?

Food for thought:

I need a very high level of precision and elaboration, which is why I use elevated reasoning. I tested lower reasoning settings, but the results were not good enough for my use case.

I have been using this model since launch, and I did not have these problems before. They started this week. It is noticeable even in the model’s reasoning time: before, it would think for longer, search through more sources, and the result was almost perfect. Usually, one or two adjustments were enough to fix anything.

Now, however, the quality feels significantly different. The model seems to reason for less time, ignores important instructions more often, and produces outputs with repetition, duplication, and content that is out of context. For long-form documents that require structure, coherence, and accuracy, this has become a serious issue.

This is why I like to make my own harnesses and chain of thought.

But that’s a pfaff of course.

It could be OpenAI is fine tuning these things on the fly to balance priorities for compute.

I’m not sure we will ever know? It’s a bit opaque

But this is useful feedback for them …

I had the same thought. Of course, this may just be my own speculation, but it feels like the model’s reasoning capacity has been reduced, or that the reasoning time has been shortened to reduce server load.

I understand that we are paying for a plan that may not fully cover the real cost of running such advanced models, so I do not want to complain unfairly. However, the frustrating part is when something is launched with a certain level of quality and performance, people subscribe because of that experience, and then later the same quality seems to be reduced in order to cut costs.

That is what makes the situation disappointing. The model I am using now does not feel like the same model I used at launch. Before, it would reason more deeply, follow instructions more carefully, search and structure information better, and usually required only one or two adjustments. Now, especially for long-form and highly structured work, it often produces repetition, duplicated sections, and content that loses coherence.

We’ve experienced similar degradation over the past week or so. It is difficult to pinpoint exactly when it started, but we have been using GPT-5.5 xhigh almost exclusively since release across several firmware and software repositories.

Along with the drop in output quality, we have also noticed faster response times and significantly higher context usage across nearly every type of work. The outputs have become so unreliable that we have had to pull back on development for the time being. At this point, the business accounts feel like cash incinerators when used for this workflow, because we are discarding most of what is generated.

This is now the second time in a month I have posted about a major disruption. The first was related to the sharp reduction in usable business account allotments. Those accounts effectively became pay-per-token almost overnight. We understood there had been a temporary 2x boost, but the actual reduction felt closer to 5x to 10x, with very little transparency around the change.

Now we are dealing with degraded output, increased context usage, and faster generation of responses that we often have to throw away. That combination creates another serious business disruption.

It is one thing to absorb higher costs. It is another thing entirely when those higher costs are paired with unusable output.

Hopefully whatever caused this regression is addressed quickly, and the broader pattern of business disruptions and limited transparency improves. The earlier Codex models won us over, and GPT-5.5 initially looked very promising, but the current experience is not workable for serious development use.

I’m trying to work around this by using Gemini through the API, but the API cost is extremely high.

Gemini is very efficient for generating long documents, especially when the goal is to produce structured written content. However, I would not recommend it for coding tasks. In my experience, it often creates things outside the original request, such as unnecessary scripts, simulations, mock data, examples, or illustrative implementations instead of strictly following the instructions.

For document generation, it can be useful, but for code-related work, this behavior creates additional problems because I then have to spend extra time correcting or removing things that were never requested.

Recently, I have noticed more and more complaints about the declining quality of the model in Codex. I did not have these issues before: I used the 5.3 Medium model and was able to solve simple tasks without problems.

Now, even with the maximum 5.5 model, the quality has noticeably declined. The model makes mistakes, duplicates code, and even when I work with small, targeted slices, perform an audit before editing the code, and clearly specify what should and should not be changed, it still finds a way to make an error.

I also noticed that with the 5.5 model in high mode, writing code can take more than 13 minutes,

2026-06-01 210250

while Gemini Flash Low handles a similar task in just a couple of minutes and even fixes mistakes made by Codex.

2026-06-01 210258

I work in both Codex and Antigravity, so I can clearly see the difference in coder quality. Most likely, this is a company decision aimed at reducing token consumption by slowing down the model, but the quality has noticeably decreased as a result.I have already had to fix Codex’s mistakes in Antigravity, including duplicated code, structural issues, and distorted design. This is very noticeable, especially when there is something to compare it with.

Basically the task itself is a problem not the model. A 30-page file is rather long and to produce it, preserving the high quality, the model needs not only to keep all the context in memory and the file it is producing itself. It also has to focus on multiple aspects of the file, which basically just tears the model apart and it loses quality because of losing the focus.

For workflows where you need to produce long documents, from my personal experience anything above four pages means thinking that the workflow wouldn’t be better to produce that file element by element and then do the final composition at the end. This way the model keeps focus on a single part of the document, produces it with high fidelity and focus, and then you have a final pass which aligns the styling of the document across it so that the sections do not drift in the presentation style.

I will agree I have used these models in every thin wrapper available and now even the standard app is failing me and wasting my time. Also noticing a lot of fluff like creating ridiculous amounts of tests and verification then write documentation in random places just for the end results to be a broken mess. If compute is the issue and everyone is treated fairly except openai they should really turn down the token waste or ai credit waste or what do I call them?

so glad we are now moving into the future way above the stars into gasp VOICE ACTIVATED ACTIONS but first nerf the Mac app the voice was working better than the whisper that just came out at .13/minute. what are we even doing?