Perceived Drop in GPT-5 Quality Over the Last Few Weeks

I have been using GPT-5.5 as an extension in a VS Code clone (Codex extension in Cursor) logged in with my ChatGPT account. I always go with “Extra High” and FAST speed settings. I haven’t seen any performance issues in the last week.

Screenshot 2026-06-01 at 1.21.17 PM

I’ll try to work around this by automating the process with shorter instructions, splitting the task into multiple steps, and validating the output at each stage. After that, I’ll leave more detailed feedback here.

Still, a 30-page task should not be a serious problem for large models. Around 15,000 output tokens should not make a model like GPT-5.5 start drifting, repeating content, or losing coherence. That would be understandable for a smaller local LLM, but not for a model at this level.

These are simple text-only documents that just need to be well written and properly structured.

There are no complex tables, charts, images, code, or technical formatting involved. The main requirement is simply coherent long-form writing with good structure, no repetition, and no duplicated or out-of-context content.

Let’s just be straight, there has been a huge drop in quality over the last week or two.

So many updates have been rolling out, I’ve been hoping one of them would fix it but in all honesty, it has got worse.

I’ve used codex daily for months now and today is the first time I gave up and came to the forums simply to see if others were having the same issues.

Simple things that used to be fixed instantly and taking 5 prompts to get right. The OP summed it up perfectly, it tries to do everything other than what you ask it to do in order to solve the problem.

I am a fully signed up member to the Codex fan club, but today it’s been awful.

I rarely go to forums to complain, but this time I felt the need to do so because I use the model frequently and I can clearly tell that the performance is not the same.

I don’t know how to prove this objectively, but people who use it frequently will understand what I mean. If they have not noticed it yet, they probably will.

Two perspectives I can add here:

First, after a very good few months with Codex, I recently had a bad stretch where many of my Codex outputs were questionable, to say the least.

In my case, the reason was me. I had made small, untested changes across my harnesses and degraded performance without noticing it right away. I had to take a step back for a few days before picking up the pieces and learning from my mistakes.

The second perspective is based on what I have seen here in the forum. There are almost no users agreeing with that sentiment. We have seen literal storms of complaints about model degradation before, but that is not happening here.

The good news may be that this is one of those learning opportunities based on tough experience. It is the kind of thing people usually do not like to talk about. After all, everyone is seemingly building faster and better all the time. Who has time to publicly reflect when things go the other way?

This is just me sharing my experience. I hope this helps!

I concur regarding Gemini and its latest iterations in regard to textual analysis. I set the same task across five major LLM providers which was to effectively summarise the play ‘American Buffalo’ by David Mamet in five sentences, whilst retaining the dramatic drive and thematic core of the original text. Gemini was head and shoulders above all overs. Literally on a different planet.

I had a terrible experience with 5.4 in the past 24hrs - it simply could not write a patch correctly and could not repeat the context correctly for a diff to match, just attempts over and over again against different parts of code, reloading the full code base into context via tool, to fail over and over.

I have logs of every code state and every diff attempted against them but there is no point in doing work investigating when the output is all hallucinations.


[result] failed: Prepare failed: Invalid Context: could not find expected block in file.
Patch context must match the current file exactly; unchanged lines need a leading space prefix.
Expected block (first lines):
[usage] in/cached: 131077/122624 | out/reasoning: 113/90

**Checking patch failure**

Hmm, it looks like the patch failed due to a context mismatch, maybe from previous patches. I should inspect the current on-disk files around the relevant sections. Since there’s no search tool available, I could use reload_context_files. This is allowed to refresh snapshots after a series of apply_patch edits. We have only one failed patch, so it seems appropriate to try refreshing the entire file. I’ll go ahead and do that!

[function] reload_context_files: completed: REFRESHED: All context files are updated to the latest on-disk versions.
[usage] in/cached: 134160/0 | out/reasoning: 770/481


Patching code carefully

I’m thinking the issue could be related to differences in blank line counts. To ensure everything goes smoothly, I want to use @@ def parse_size_string(size_str: str) to delete the function. It might be wise to work with smaller code patches for accuracy. I’d also like to inspect memory first, but since there’s no search tool available, I’ll focus on patching functions with context diff markers. Let’s make sure to reference the exact original code for safety.

Planning patching strategy

I think there might be two blank lines after the return and before the …

[result] failed: Prepare failed: Invalid Context: could not find expected block in file.
Patch context must match the current file exactly; unchanged lines need a leading space prefix.
Expected block (first lines):


Exploring potential changes

I don’t have exact information here. I could use delete or replace methods around known unique docstring lines. It’s possible that the global function’s docstring changed due to our patch, but it might also be the same. I wonder if using a patch hunk specifically around …

[result] failed: Prepare failed: Invalid Context: could not find expected block in file.
Patch context must match the current file exactly; unchanged lines need a leading space prefix.
Expected block (first lines):

**Analyzing context and format issues**

I’m puzzled by why the file isn’t working. Maybe it uses tabs or requires exactly four spaces instead of nine? I’m wondering if the special anchor isn’t matching properly. It usually works, but perhaps the reload_context_files isn’t loading the full file correctly.

I think the tool might not understand special anchors semantically. From the developer docs, it seems like the correct way to indicate functions is with the @@ operator, but I might need to adjust how the context lines are being used. Let’s try using plain raw context lines to see if that resolves the issue.

[result] failed: Prepare failed: Invalid Context: could not find expected block in file.
Patch context must match the current file exactly; unchanged lines need a leading space prefix.

Just on and on, often with little reasoning cost at xhigh, and patches also seeming to then ignore their failure state and the AI moves elsewhere writing different diff context, without reasoning summary, that fails. At least reloading files doesn’t kill the 90% cache discount, because nothing in files changed, yet it still blows away a days complementary 1M tokens and continues until killed.

A restart of the same turn and restored code on 5.5 high was done just fine.

Perspective here: “let’s break old models that don’t bill twice as much”

I’m using chat-gpt 5.5 heavy thinking to help me build a fps multiplayer game… 50000 lines of code so far, 20 living documents… gpt 5.5 does a stellar job sticking to and updating the docs and navigating on 300+ files… I’m not even using codex but simply being very structured in chat-gpt.

I’ll add: even 5.5 has a context recitation problem: continuing after a patch surface into made-up code, or just writing wrong sometimes.

It’s own analysis with its patches conversation, long but then summarized:

Bottom line

The patch failures were not caused by the tool namespace or an apparent parser issue. They were mainly caused by me:

  1. writing long hunks with guessed interior lines,
  2. assuming context that was not present,
  3. reusing stale patch fragments after successful edits,
  4. treating the hunk locator as more semantic/fuzzy than it is.

The later successful patches worked because they used smaller, literal, current-state context.

Seems these models need not an internal temperature:0.98, but rather, a top-k:1, as there is no room for second place tokens when coding.