ok… 5.4 thinking appeared in cloud ui… it’s time for testing
nope… 5.4 looks worse than 5.1 in thinking… seems like I’m out
Details
As a follow-up to the whole “5.1 vs 5.2 vs everything else” discussion, I ran a small blind test.
Setup
-
I took the same lecture and asked three different models to write a summary.
-
Each summary was saved as
001.txt,002.txt,003.txt.-
Model 001 wrote 001,
-
Model 002 wrote 002,
-
Model 003 wrote 003.
-
-
Then I asked each model to evaluate all three summaries and pick the best one.
-
None of them knew which model wrote which file – they only saw the texts.
Result
All three models independently came to essentially the same ranking:
-
Best: 003.txt
-
Consistently described as the most complete, structured and balanced:
-
covers the full set of topics from the lecture,
-
keeps the question-→-answer structure,
-
preserves almost all key examples and images,
-
reads like a solid working recap, not a messy transcript.
-
-
-
Second: 002.txt
-
Described as the most pleasant “literary” version:
-
smooth language, nice flow, easy to read,
-
but noticeably less complete – some topics and examples are missing.
-
-
Good as a “nice article”, weaker as a full lecture summary.
-
-
Third: 001.txt
-
Seen as fairly complete but rough:
-
more “raw transcript” energy,
-
heavier, more repetitive, worse transitions,
-
useful as a draft, but not the best final version.
-
-
Then I asked ChatGPT 5.2 (separately, “out of competition”) to evaluate the same three files.
It gave the exact same ranking — 003 > 002 > 001 — and almost the same reasoning:
-
003 = best balance of coverage + structure + clarity,
-
002 = best style but cut down,
-
001 = dense, but rough and less readable.
Takeaway
-
When you normalize the task (same lecture, same prompt) and look at the texts blindly,
the models converge on the same notion of “quality” and even agree on which summary is best. -
The gap between models as writers is often smaller than the gap between:
-
“full + well-structured edit” vs
-
“partial / rough / under-edited text”.
-
Moment of truth
Only after all that did I reveal who was who:
-
001 = One of most famous free LLM
-
002 = GPT-5.4
-
003 = GPT-5.1
So:
-
Three different models, plus GPT-5.2 itself, all blindly picked 5.1’s summary as the best overall.
-
GPT-5.4 ended up in second place — nicer wording, but less complete.
-
Free LLM landed in third, still decent, but clearly rougher.
Takeaway:
When you strip away branding and version numbers and just look at real tasks blind, even the models themselves keep voting for 5.1 as the most balanced, “actually useful” option.
Shame.
ps Another important thing is:
GPT-5.4 is not leagues above Free LLM — they sit in the same quality tier.
-
In the blind evaluations, no model called Free LLM output as trash.
-
FLLM and 5.4 traded second and third place, depending on the judge:
-
sometimes 5.4 was ahead (“nicer style, but less complete”),
-
sometimes FLL was ahead (“fuller, just rougher”).
-
-
In other words: 5.4 performs roughly on par with a Free competitor.
So if you strip away branding and version numbers and just look at what they actually write:
-
5.1 looks like a genuinely strong, well-balanced assistant.
-
5.4 does not look like some “next-level intelligence” it just looks like a slightly more polished, slightly more trimmed alternative.
-
FLLM clearly plays in the same league as 5.4 on this kind of task.