Please don’t retire GPT-5.1 Thinking – GPT-5.2 feels worse

dvdolgikh · March 6, 2026, 8:43am

ok… 5.4 thinking appeared in cloud ui… it’s time for testing

nope… 5.4 looks worse than 5.1 in thinking… seems like I’m out

Details

As a follow-up to the whole “5.1 vs 5.2 vs everything else” discussion, I ran a small blind test.

Setup

I took the same lecture and asked three different models to write a summary.
Each summary was saved as 001.txt, 002.txt, 003.txt.
- Model 001 wrote 001,
- Model 002 wrote 002,
- Model 003 wrote 003.
Then I asked each model to evaluate all three summaries and pick the best one.
None of them knew which model wrote which file – they only saw the texts.

Result

All three models independently came to essentially the same ranking:

Best: 003.txt
- Consistently described as the most complete, structured and balanced:
  - covers the full set of topics from the lecture,
  - keeps the question-→-answer structure,
  - preserves almost all key examples and images,
  - reads like a solid working recap, not a messy transcript.
Second: 002.txt
- Described as the most pleasant “literary” version:
  - smooth language, nice flow, easy to read,
  - but noticeably less complete – some topics and examples are missing.
- Good as a “nice article”, weaker as a full lecture summary.
Third: 001.txt
- Seen as fairly complete but rough:
  - more “raw transcript” energy,
  - heavier, more repetitive, worse transitions,
  - useful as a draft, but not the best final version.

Then I asked ChatGPT 5.2 (separately, “out of competition”) to evaluate the same three files.

It gave the exact same ranking — 003 > 002 > 001 — and almost the same reasoning:

003 = best balance of coverage + structure + clarity,
002 = best style but cut down,
001 = dense, but rough and less readable.

Takeaway

When you normalize the task (same lecture, same prompt) and look at the texts blindly,
the models converge on the same notion of “quality” and even agree on which summary is best.
The gap between models as writers is often smaller than the gap between:
- “full + well-structured edit” vs
- “partial / rough / under-edited text”.

Moment of truth

Only after all that did I reveal who was who:

001 = One of most famous free LLM
002 = GPT-5.4
003 = GPT-5.1

So:

Three different models, plus GPT-5.2 itself, all blindly picked 5.1’s summary as the best overall.
GPT-5.4 ended up in second place — nicer wording, but less complete.
Free LLM landed in third, still decent, but clearly rougher.

Takeaway:
When you strip away branding and version numbers and just look at real tasks blind, even the models themselves keep voting for 5.1 as the most balanced, “actually useful” option.

Shame.

ps Another important thing is:

GPT-5.4 is not leagues above Free LLM — they sit in the same quality tier.

In the blind evaluations, no model called Free LLM output as trash.
FLLM and 5.4 traded second and third place, depending on the judge:
- sometimes 5.4 was ahead (“nicer style, but less complete”),
- sometimes FLL was ahead (“fuller, just rougher”).
In other words: 5.4 performs roughly on par with a Free competitor.

So if you strip away branding and version numbers and just look at what they actually write:

5.1 looks like a genuinely strong, well-balanced assistant.
5.4 does not look like some “next-level intelligence” it just looks like a slightly more polished, slightly more trimmed alternative.
FLLM clearly plays in the same league as 5.4 on this kind of task.

Topic		Replies	Views
Some thoughts on human-AI relationships Community chatgpt	40	4337	June 25, 2025
GPT-5 Coding Feels Downgraded — Please Fix This Codex	127	16470	January 27, 2026
Big Idea: GPT as a universal concept translator Community idea	46	1983	September 5, 2024
Day 12 of Shipmas: New frontier models o3 and o3-mini announcement Community shipmas	71	9609	December 26, 2024
GPT-5.4 Pro and Thinking are here! Announcements	28	10604	April 3, 2026

Please don’t retire GPT-5.1 Thinking – GPT-5.2 feels worse

Moment of truth

Related topics