Hi everyone,
I’m trying to determine whether other users are seeing a similar behavior change with GPT-5.4 Pro Standard on long-context, high-effort tasks.
I’m not claiming a confirmed backend bug. I’m looking for comparison data because the change I observed is large enough that it does not look normal
What I tested
I have a repeatable long-context task that requires the model to:
-
read a large uploaded context/file packet,
-
reconcile multiple source documents,
-
identify pending work,
-
produce a concrete written deliverable,
-
include an actionable implementation/review plan.
This is not a short Q&A prompt. It is the kind of task where the model needs sustained reasoning and careful file/context handling.
What I observed (using the same task as to have imperial test diagnostic data)
A prior run of the same class of task, using GPT-5.4 Pro Standard, took roughly 60 minutes and completed the work correctly.
A later run, also using GPT-5.4 Pro Standard, completed in roughly 8 minutes, but the output was materially lower quality. It looked more like a readiness/summary response than the actual requested deliverable. Same task and files, it just change from a day to the next.
The issue was not simply that the model was faster. The issue was:
GPT-5.4 Pro Standard run A: ~60 minutes, complete and correct
GPT-5.4 Pro Standard run B: ~8 minutes, incomplete and missing the core deliverable
Why this seems concerning
For this task type, a correct answer required the model to stay engaged across a large context and produce a concrete output. Instead, the shorter run appeared to stop at a high-level framing/acknowledgement stage.
The shorter run did not just compress the work. It skipped the central artifact the task required.
This resembles a lower effective reasoning-effort budget, but I cannot see the hidden backend setting, so I do not know whether the cause is:
a temporary routing/configuration issue,
a hidden reasoning-effort change,
file/context handling degradation,
early stopping behavior,
or normal model variance.
Why I do not think this is just normal variation
A swing from about 60 minutes to about 8 minutes for the same class of long-context task is large by itself.
But the stronger signal is output completeness:
Earlier run: long duration, complete deliverable
Later run: short duration, plausible-looking summary, missing deliverable
The later answer looked superficially responsive, but it did not complete the actual work requested.
This is a repeated pattern I’ve noticed before when a new model was released and one stay using the same “old” or not current latest model, so maybe is the case since this happen on a Saturday Apr 18th, that a new model might come out or something, but not some I can know.
Secondary tool/context anomalies
I also noticed some possible tool/context weirdness during diagnostics, though these may be separate issues:
-
uploaded file retrieval seemed inconsistent;
-
search over uploaded/context files appeared to surface unrelated prior material;
-
a simple Python/stdout test behaved inconsistently in one diagnostic path, while a direct Python path worked.
Again, those may be unrelated, but I’m mentioning them in case others are seeing similar clusters.
Questions for other users
Has anyone else recently seen GPT-5.4 Pro Standard:
-
finish long reasoning tasks much faster than before;
-
produce a plausible-looking summary instead of the requested artifact;
-
appear to use a lower effective thinking budget;
-
skip file/artifact production in tasks where prior runs completed it;
-
behave differently across otherwise similar Standard-mode sessions?
Useful comparison data would be:
same or similar prompt
same uploaded/context size
model setting used
earlier run duration and quality
later run duration and quality
whether the final deliverable was actually produced
whether the run seemed to stop at summary/readiness instead of execution
I’m trying to determine whether this is expected variance, a temporary configuration/routing issue, a file/context handling issue, or a broader regression in effective long-context reasoning within GPT-5.4 Pro Standard.
It all seems faster and likely users will say or noticed that ChatGPT is alot faster, like 4x or more, before it was slower to go from a prompt sent to Thinking and in the Thinking tab the steps usually would take longer now it all goes much much quicker similar in a way to a real-time chat. Just want to know if this is the new normal so I can see what and how to engineer around it or alternatives.