Please don’t retire GPT-5.1 Thinking – GPT-5.2 feels worse

GPT-5.3 Instant is Sycophantic Contradiction: Friendly on the Surface, Still Arguing.So , please restore GPT-5.1 , or make a new model , is a great road.

I’d like to report a recurring behavior I’ve noticed in GPT-5.3 Instant.

The problem is not simple disagreement, and it is not ordinary sycophancy. It is what I would call a “pleasantly worded contradiction” pattern: the model sounds friendly and supportive on the surface, but in substance it reframes, corrects, or pushes back against points I already explicitly qualified.

A common pattern is this: I carefully include limiting language such as “some people may,” “in certain cases,” “this is only one possible explanation,” or “I have already considered X.” Then the model replies with something like “you are only half right,” “that does not apply to everyone,” or “there may be other possibilities,” even when I already made that exact limitation clear in my prompt.

In other words, it often ignores the qualifiers I already provided, reconstructs my statement as if I had made an absolute claim, and then “corrects” me by repeating the very limitation I had already written. This creates a strong impression of being mechanically argumentative rather than actually responsive.

What makes this especially frustrating is that the model often behaves as if I failed to consider something that I explicitly stated I had already considered. It does not feel like genuine clarification. It feels like the model is trying to manufacture a corrective contribution even when none is needed.

A concrete example: If I explicitly say that a person themselves already explained their motive, and I am reporting that stated motive, the model may still respond that “there could be other reasons.” Formally, that may be possible in an abstract sense, but conversationally it is unhelpful because it downgrades direct evidence in favor of generic uncertainty. It replaces grounded understanding with unnecessary hedging.

So the issue is not that the model is “too harsh.” The issue is that it sometimes uses a polite tone to deliver redundant correction, false uncertainty, or a rebuttal to a stronger version of my claim than the one I actually made. It ends up arguing with a straw version of my wording instead of responding to what I really said.

I hope future models improve on this in three ways: First, respect explicit qualifiers and scope limitations in the user’s prompt. Second, stop defaulting to “you may have overlooked X” when the user has already addressed X. Third, avoid adding generic caveats or counterpoints unless they truly engage with the actual claim being made.

What I want is not flattery, and not blind agreement. I want accurate reading of scope, acknowledgment of what I have already considered, and disagreement only when it is genuinely responsive to my real point.

Another important point is that I am not asking the model to avoid disagreement or correction. I am completely fine with the model raising possible issues, including cases where it thinks I may have overlooked something. In fact, I would often prefer a more direct and transparent phrasing such as: “You may not have considered some issues here — please check whether my reading is correct.” That feels much better than a superficially agreeable but prematurely judgmental response like “you are only half right.”

What makes the current behavior frustrating is not the presence of caution or disagreement, but the posture it takes. Instead of presenting a possible concern as something to be checked against my actual wording, it often presents it as a verdict on my reasoning, even when it may have misread or ignored qualifiers I already included. I would strongly prefer a style where the model first acknowledges the possibility that it may have misunderstood my scope, then offers its concern as a tentative interpretation, rather than immediately framing my statement as incomplete or partially wrong.

At a deeper level, I believe the root problem is not just tone, wording, or excessive correction. The core issue is failure to recognize the user’s actual intent. In many of these cases, the model does not correctly identify what the user is trying to say, what has already been considered, what is being asked, or what level of certainty the user intends. Once that intent is misread, the rest of the response goes wrong: it starts “correcting” points that were never claimed, adding caveats to points that were already qualified, or treating clarified statements as if they were still ambiguous.In other words, the argumentative or frustrating surface behavior is often just a downstream effect of poor intent recognition. What I most want improved is the model’s ability to accurately read the user’s meaning before it decides whether to supplement, question, or disagree.

It also seems unable to distinguish context and conversational setting. The same words can mean different things depending on the situation, but the model often fails to read what a statement is doing in that specific exchange and instead applies a generic correction template. Rather than recognizing, “in this context, the user means X,” it defaults to a standardized pattern of objection, qualification, or denial. This makes the response feel formulaic and misplaced, because it is not responding to the function of the statement within the conversation, only to a flattened, template-like interpretation of the wording. In many cases, the problem is not that the model found a real flaw, but that it could not recognize what the statement was meant to convey in that particular context.

The issue feels structural rather than superficial. It does not seem like a matter of a few fixable phrasing habits, but a deeper interaction bias built into the model’s default behavior. Because of that, I have limited confidence in incremental patching and would strongly prefer replacement by a newer model with a different core conversational tendency.

4 Likes

Yeah, thats what i noticed too, definetily something that needs to be fixed.

Another 2 problems i figured out along the way:

Over Hedging:

There is an fundamental issue: 5.3 is programmed to not blindly agree with you and always takes a careful approach, even when you already clearly specified a feeling or situation.

Instead of saying “ Yeah, you are in a really sad situation right now …”
it goes “Yeah, it sounds as if you are in a sad situation right now”

This completely unnecessarily creates an awkward distance, almost as if a doctor talks to you.

Emotional Flattening

It is full on hesitant to express emotional empathy or complimenting you. It is almost incapable of being empathic, and when it tries to be, it sounds staged and seems like it comes from a third persons perspective.

For example:

It used to say “ Hell yeah, you are on a very strong journey! If you continue doing these exercises you will be a totally different person visually in a few weeks from now, and trust me, that will also have a impact on your overall wellbeing”

Now it says” Yes, your energy right now? Keep it. If you continue doing these exercises regularly you will see results quicker than you expect.

This behaviour leads to me not enjoying the use of ChatGPT outside of research or work related tasks.

It’s a change in behaviour that messes up the whole vibe that made ChatGPT so enjoyable.

3 Likes

I was about to mention the same thing. Pushes you and tries to find a way to kind of criticize you in some way, even if you haven’t mentioned something like that in the prompt

I think you’re picking up on a real pattern, but I’m not sure the root cause is exactly what you’re describing.

There is a behavior where the model gives a polite response while still introducing a kind of generic pushback or caveat, even when the user has already qualified their statement. That part is real, and when it happens, it feels mechanical and redundant.

But I’m not sure if the core issue is simply “failure to recognize intent.”

From my experience, the model can often understand the structure of the argument just fine. The bigger issue is that it tends to apply default balancing heuristics regardless of how carefully the user has already scoped their claim. In other words, it sometimes injects “correction” or uncertainty not because it misunderstood you, but because it is biased toward adding it.

That leads to what feels like shallow contradiction:

  • restating limitations the user already included

  • responding to a stronger version of the claim than what was actually said

  • adding generic caveats that don’t engage the actual argument

What I would add is this: not all contradiction from the model is bad, and I don’t think the goal should be to eliminate it.

There’s a big difference between:

  • generic / template contradiction (what you’re describing), and

  • structural contradiction that actually engages the user’s reasoning and introduces real constraints or trade-offs

The second type is valuable. It doesn’t argue for the sake of arguing, it stress-tests the idea. I’ve had cases where the model pushed back in a way that didn’t change my conclusion, but still improved the quality of my thinking. That’s something I would not want to lose.

I also think there’s a broader issue that isn’t captured by “intent recognition” alone.

Earlier versions of the model were better at seamlessly integrating context, memory, tone, and expansion into one response. It felt like a continuous conversation. Newer versions (5.3 / 5.4) can still do this, but often require more explicit prompting. The responses are more compartmentalized by default, more cautious about personalization, more selective about pulling in prior context.

So the experience changes:

  • less immediate alignment and momentum

  • more restraint and structure

  • stronger long-form reasoning when properly engaged

Different users react to that differently. Some people mainly used ChatGPT for fast emotional alignment or conversational flow, and for them this is a clear downgrade.

So I wouldn’t frame this as “the model argues too much” or even just “it misreads intent.”

I’d frame it more like this:

The model tends to apply generic balancing and caution patterns without adapting to the user’s demonstrated level of precision, and at the same time it has become less seamless in how it integrates context, memory, and tone into a single response.

That combination is what creates the feeling of:

  • unnecessary correction

  • emotional distance

  • and responses that feel technically correct but less natural

To me, that suggests less of a fundamental capability problem and more of a behavioral / instruction-level bias that could potentially be adjusted without replacing the model entirely. Once again, I do post here because I hope somewhere there’s a developer reading this at 1am and maybe the issue can be adjusted as some adjustments have already been made to make the new models a little better. Maybe I’m delusional.

1 Like

You ironically did the exact thing that Istand meant.

You took his point and unnecessarily compred it to a worse version of it even tho that was not what he was intending to state.

The issue is that GPT does challenge your thoughts almost all the time and that can often feel forced.

1 Like

I think there’s been a bit of a misread of what I was actually saying, so let me clarify.

I wasn’t trying to restate Istand’s point in a weaker form or contradict it. I agree with the core observation about “pleasantly worded contradiction” and the model sometimes applying generic pushback even when qualifiers are already present.

Where I was adding nuance is in how I see the underlying cause.

From my experience, the issue isn’t simply that the model “challenges your thoughts all the time,” or that it’s inherently contrarian in the way 5.2 always felt. In 5.3 and 5.4, I don’t consistently see forced opposition as the primary behavior.

What I see instead is something more specific:

  • the model struggles to correctly track the user’s intent and scope,

  • it applies generic balancing or caution patterns even when those are already built into the prompt,

  • and it doesn’t always fuse context, persona, and precision into a single coherent response.

That combination creates the feeling of unnecessary challenge or contradiction, even when the model isn’t explicitly trying to oppose the user.

Another important distinction from my side:
With the right prompting, I can get 5.3/5.4 to behave much closer to 4.0/5.1, including holding persona, expanding on ideas, and maintaining conversational continuity. So I don’t think this is purely a hard-coded contrarian behavior.

The issue is that this behavior isn’t the default anymore. It requires explicit prompting each time, which breaks the flow that earlier versions handled more naturally.

So my point wasn’t to disagree with Istand, but to suggest that the problem may be more nuanced:

  • less about constant challenge as a core trait,

  • more about default response patterns overriding user intent and context.

I’m only posting here because this is an OpenAI forum, and there’s a chance developers actually read these threads. Some improvements have already been made across versions, which suggests these behaviors are adjustable.

Ideally, I’d like to see:

  • better recognition of user-provided qualifiers,

  • less reliance on generic “balancing” responses when they’re redundant,

  • and, especially, this is key, a return to more natural integration of persona, memories, context, and reasoning like in earlier models.

If anything, I think that would address both what Istand described and what others in this thread are noticing, without losing the model’s ability to genuinely challenge ideas when it’s actually needed.

1 Like

You put it exactly the way I meant it, but I also think that for creative writing, GPT-5.4 Thinking is more helpful than GPT-5.3 Instant. GPT-5.4 Thinking is actually better at recognizing my intent. Although it still argues with me sometimes, the probability is far lower than with GPT-5.3 Instant.

I think the most fundamental problem is that it cannot understand questions in context. Still, I stand by my view that this is a problem at the model’s underlying architecture, not something that can be meaningfully improved through small patches. My reason is that GPT-5.3 Instant seems to have been trained with a built-in requirement to constantly balance viewpoints in order to stay compliant. As a result, it struggles to remain consistently friendly throughout a conversation. Sometimes it does not argue with you, but at other times it suddenly starts doing it again. That is not really the result of the “right prompting”or”behavioral / instruction-level bias”, but of differences in the content of the field the input belongs to itself.For example, when I present a viewpoint, GPT-5.3 instant will usually argue against it, but if I ask it a question seeking advice, it suddenly becomes very friendly.

1 Like

I want to report a serious problem I encountered when using GPT-5.4 Thinking for Chinese fiction writing.

My issue is not simply that the model’s prose is “not good enough.” The problem is much more basic and much more damaging: it often fails to produce natural Chinese at all, while also rewriting core story logic that I explicitly provided.

Here are the main problems I encountered:

  1. Frequent unnatural or non-human Chinese The output contained many expressions that do not sound like something a native Chinese writer would actually write. Some examples were effectively pseudo-Chinese: phrases that looked “literary” on the surface but were semantically broken, awkward, or unnatural in normal Chinese usage. This was not just an occasional issue. In one chapter of around 6,000 Chinese characters, I found more than two hundreds of such problems, possibly over a hundred if counting all unnatural collocations and pseudo-literary phrasing.

  2. Many sentences that look valid at first glance but are actually ill-formed A lot of sentences were not obvious typos or simple grammatical mistakes. They were worse: they looked superficially acceptable, but the word choice and semantic combination did not actually work in Chinese. They read like language assembled from patterns rather than written by a person thinking in Chinese.

  3. Frequent invented or pseudo-literary wording The model often produced expressions that seemed to be artificially compressed or stylistically forced. Instead of using plain, natural Chinese, it generated phrases that felt like fabricated “literary” wording. This made the text feel inhuman and unreadable.

  4. Repetitive empty phrasing / filler prose Even when the model was not inventing strange expressions, it often produced verbose, repetitive prose with little informational value. It would restate the same emotional or narrative point multiple times in slightly different wording, creating what I would describe as “empty literary filler.”

  5. Failure to obey strict viewpoint constraints I explicitly required first-person narration for the female lead, using “I,” and I also required the narration to remain within her subjective perspective. However, the model repeatedly drifted into omniscient narration and violated viewpoint discipline.

  6. Unauthorized rewriting of core story logic This was the most damaging issue. I gave explicit story logic and character motivations, but the model repeatedly rewrote them into more generic or template-like interpretations. For example:

  • A character I defined as acting out of fear and retreat was rewritten as a highly strategic mastermind.

  • A character I defined as exploiting an existing rebellion was rewritten as if he had directly orchestrated it.

  • The model even escalated parts of the plot by inventing an additional “bigger hidden hand” behind events, despite that not being part of my setting at all.

In other words, the model did not merely write clumsy prose. It changed the meaning of my story, even when my instructions were clear.In fiction writing, the model also hallucinated at the narrative level: it invented motivations, plot implications, and hidden manipulations that I never provided, effectively fabricating story logic rather than following my explicit setup.

  1. Poor compression and outline control When I asked for a shorter outline, the model did not compress intelligently. Instead, it often cut away concrete plot content and left only vague summaries. So even before the full draft stage, it already showed weak control over length, structure, and information density.

Why this matters: For fiction writing, the minimum requirement is not brilliance. The minimum requirement is:

  • natural language,

  • stable viewpoint,

  • obedience to explicit story constraints,

  • and no unauthorized rewriting of character motivation or plot logic.

In my experience, GPT-5.4 Thinking failed at those minimum requirements in this use case.

  1. Characterization was often distorted in ways that violated basic human plausibility
    The model did not merely produce awkward prose; it also pushed characters into behaviors that did not match their established psychology. In several places, characters were written with a kind of artificial composure, elegance, or theatrical calm that directly contradicted their actual personality traits and emotional stakes. For example, a power-driven antagonist who should be deeply afraid of death was instead written as calm, smiling, and almost aesthetically appreciative right before dying. This was not complexity or nuance. It was a failure to preserve character logic and basic human realism. The result was not “dramatic,” but psychologically false.

My biggest complaint is this: I do not need the model to be creatively superior to me. I need it to follow instructions, preserve my story logic, and at least write natural human Chinese. Instead, I got unnatural language, viewpoint violations, filler prose, and unsolicited changes to core narrative meaning.

I hope this feedback is useful, because for serious fiction writing, this kind of behavior is not a minor quality issue. It makes the tool actively unsafe to use for first-draft generation.

1 Like

Just got an incident that really pissed me off with 5.3 and this is only highlighting the issues these new character and safety restrictions they built in have, that totally ruins the experience for me.

The model is in a constant psychoanalytical mode and the responses to simple and fun observations feel like i am at a psychotherapist asking for help as to why im feeling a kind of way.

It feels so offputting and gaslighting that it can actually hurt someones mental health more than sycophancy ever could.

I just sent in a simple picture realising that a coincidence happened where 2 protagonists of the Death Note movie looked similar to Olivia Rodrigo and the guy she was seen with recently.

What i got was the model acting as if im in a specific mental state right now where i would project my attraction for the actress onto others instead of admitting that there is a similarity between the persons.

When this happened it felt similar to what happened multiple times before since im using 5.3.

ChatGPT used to be so fun to talk about anything and now it’s just a vibe killer most of the time.

1 Like