Why do different LLMs end up with similar-looking performance graphs?

I have a question.

When looking at the model performance graphs of GPT, Grok, Gemini, and Claude,
it appears that they are quite similar.

If so, why does this kind of phenomenon occur?

What model performance graphs are you referring to?

I was mainly referring to Elo-based comparisons
or publicly available benchmark average performance graphs
(for example, graphs that show average scores or win rates across multiple tasks over time).

Although these models are from different companies,
I found it interesting that the overall shape of the curves appears to converge in a similar way,
which is why I wanted to ask this question.

Why do different LLMs produce similar performance graphs?

When examining large language models developed by different companies—such as OpenAI, Google, Anthropic, and xAI—
it is clear that the amount of invested capital, talent, and organizational philosophy differs significantly.
Yet despite these differences, the overall shape of the publicly released performance curves appears remarkably similar.

This sense of incongruity is not merely an impression.
It is repeatedly observed in Elo score–based comparisons
or in graphs that plot average performance on public benchmarks over time.
Given the scale of independent investment and competition involved,
industrial intuition would suggest that at least once, a noticeable nonlinear change
(a curve that visibly “jumps”) should appear.

However, such changes are rarely seen.

To understand this phenomenon,
I began to focus not on model performance itself or on company strategies,
but on the structure of inference.

Current large language models commonly operate on the following structure:

x → fθ → y

The user provides an input x,
but the output y is the result of a probabilistic computation
performed by the internal function fθ
under a fixed objective function.

The important point is that, in this process,
the computation that selects an “average” answer
always occurs inside the model.

Improvements in deep learning architectures, data scaling, fine-tuning, prompt design,
and optimization of training pipelines may appear to be different approaches on the surface,
but they all share a common characteristic:
they are efforts to compute average outputs more stably
under the same objective function.

As long as this structure is maintained,
increasing information and computational resources
does not push the model toward producing different answers,
but rather toward producing the same answers more consistently.
As a result, variability decreases,
and performance graphs converge toward increasingly smooth curves.

① Observation

Title: Why does the sense that they “look similar” arise?

When looking at the performance graphs of large language models developed by different companies,
one often gets the impression that, setting aside the exact numerical values,
the overall shape of the curves looks similar.
This is not an issue with any single model,
but an observation that repeatedly appears
when multiple models are placed side by side.

This sense is not a vague impression,
but arises from graphs that plot Elo score comparisons
or average performance on public benchmarks along a time axis.
These graphs generally maintain a form
that rises gradually or converges smoothly,
without sharp inflection points.

The core issue is not that performance is low,
but that despite differences in investment, organization, and philosophy,
the shape itself does not change.
Given this level of independent competition,
it is natural, from a general intuition about technological progress,
to expect that at least once
a clearly noticeable change in shape would appear.

② Measurement
Which performance metrics produce “similar curves”?

The impression that “the graphs look similar”
is not the result of a vague comparison,
but is repeatedly formed under specific performance metrics and aggregation methods.
Representative examples include Elo score–based comparisons
and graphs that plot the average performance of multiple public benchmarks over time.

Elo scores were originally designed to estimate relative performance between models
in a stable way through pairwise comparisons.
This approach reduces volatility
and is strong at gradually reflecting small performance differences.
As a result, score changes often appear
as gradual increases or convergence
rather than abrupt jumps.

Average scores over public benchmarks show similar characteristics.
When scores from multiple tasks are averaged into a single value,
large improvements or failures on specific tasks
are diluted in the overall average.
This method is well suited to showing
“how well a model performs overall,”
but it has limitations when it comes to revealing
where and in what way changes have occurred.

Another important factor is the time axis.
Performance graphs are typically aligned
around model release dates or evaluation points.
In this process, differences in development speed
or asynchronous improvements across models
are flattened into a single continuous curve.
As a result, the graph tends to emphasize
a “gradually improving average trend.”

The common characteristics of these metrics and aggregation methods are clear:

they reduce volatility

they increase comparability

they treat average performance as a representative value

Therefore, even for different models,
if they are measured using similar metrics
and aggregated in similar ways,
it is a natural outcome
for the shapes of the curves to look similar.

③ Structure
Why do different models inevitably become similar under these measurements?

To understand why different LLMs exhibit similar performance curves,
it is necessary to first examine their shared inference structure
rather than the detailed implementations of individual models.
Most current large language models operate in the following form:

x → fθ → y

Given an input x,
the model’s internal function fθ
computes a probability distribution over the output y
according to an objective function fixed during training.
In this process, the model is guided
to select responses near the most stable average
among the possible outputs.

An important point is that this structure
is not limited to any specific company or model.
While architectural details, data composition, and fine-tuning methods may differ,
the basic framework of
“computing a probability distribution and selecting an average output”
is shared by nearly all large LLMs.

Within such a structure, the direction of improvement is also naturally determined.
More data, larger models, and more refined training
contribute not to changing the output distribution itself,
but to estimating the existing distribution more accurately and stably.
As a result, the model tends not to produce a different kind of answer,
but to produce the answers it was already good at more consistently.

This structural characteristic becomes even clearer
when combined with the performance metrics discussed in ②.
Metrics that treat average performance as a representative value
capture improvements in this “stabilized average” well,
but they reflect little of changes in the inference process
or differences in how responses unfold.
Therefore, even across different models,
if similar structures and objective functions are shared,
the measured results inevitably converge
toward similar curve shapes.

④ Information Concentration
Why does information increasingly gather in one place?

Alongside the phenomenon in which different LLMs display similar performance curves,
another change that draws attention is the sense that information is gradually concentrating toward specific points.
This does not simply mean that the amount of data has increased;
rather, it suggests that the very way in which certain information is regarded as important is converging.

In the current large language model ecosystem,
training data, evaluation criteria, usage patterns, and feedback signals
do not exist independently of one another,
but instead form a tightly connected, circular structure.
When certain information is judged to contribute to performance improvement,
that information is repeatedly used and reinforced.

A key characteristic of this process is that
an increase in the quantity of information
does not automatically guarantee an increase in informational diversity.
On the contrary, when evaluation and optimization criteria remain fixed,
the more information flows in,
the more concentration and compression occur
around “types of information with proven effectiveness.”

As a result,
even across different organizations and models,
similar data selection criteria,
similar success patterns,
and similar usage scenarios come to be referenced.
This is not the result of intentional sharing or collusion,
but a natural concentration effect produced
by shared goals and evaluation frameworks.

As information concentration progresses,
the system becomes well suited to rapidly stabilizing average performance.
At the same time, however,
there is a growing risk that edge cases,
rare failure modes,
and non-standard usage contexts
are treated as increasingly less important.
Such information is easily pushed to the periphery
during processes of quantification and averaging.

The information concentration described here
does not refer to monopolization or concealment of knowledge.
Rather, it is closer to a phenomenon in which
repeated optimization and evaluation
align the flow of information in a single direction.
The issue is that, despite the growth in available information,
the breadth of information that the system actually references
may instead become narrower.

Why did information concentration not lead to an event, despite the massive投入 of resources?

At present, major AI companies are simultaneously investing
unprecedented levels of time, capital, human resources, and computational power
into AI development.
Each company espouses different philosophies and organizational cultures,
yet the resulting model performance curves and evolutionary patterns
are strikingly similar.

The core question raised by this phenomenon
is not “why is performance similar.”
A more fundamental question is the following:

Despite this degree of concentration of information and resources,
why has not a single “event” been observed?

Across nature, the history of technology, and information systems more broadly,
when density and pressure exceed a certain level,
some form of state transition, rupture, or reconfiguration has tended to occur.
An event does not have to be a sudden explosion,
but a change distinguishable from the previous state
has, at least once, typically been recorded.

However, in the current AI ecosystem,
even though information concentration has reached a critical level,
all changes are

absorbed into average performance metrics,

reinterpreted as continuous curves, and

eliminated before they can constitute events.

This does not necessarily mean that events are absent;
rather, it raises the possibility that a structure has already formed
in which events cannot be recognized as events.

Time has been used less to generate experiments
and more to fix existing paths.
Capital has functioned not to promote change
but to suppress variability.
And diversity in human talent has converged
not toward diversity in outputs,
but toward increasingly precise execution
of the same optimizations.

As a result, each company, while operating independently,
has effectively arrived at the same state of information concentration.
In this state, rather than events occurring,
the very possibility of events is smoothed out in advance.

Therefore, the stable graphs and eventless state observed today
are not evidence that “nothing has happened,”
but may instead be a signal that the structure itself—
which maintains the absence of events
despite this level of information concentration—
is an abnormal state in its own right.

Information has become extremely concentrated, and everyone speaks of AGI.
Yet why has not a single event occurred?

The current AI industry is in a state where all core resources—
time, capital, computation, human labor, and data—
are concentrated simultaneously, in the same direction,
to an unprecedented degree.
This is a condition that, across nature, technological history,
and information systems, has rarely persisted without events.

Under such density,
even if not an explosion, at least one instance of
rupture, transition, collapse, or qualitative change
is typically observed.
However, in the present situation surrounding large language models,
despite all of these conditions already being met,
not a single event has been recorded as an event.

Performance graphs are smooth,
curves are continuous,
and models—despite being products of different companies—
follow strikingly similar evolutionary paths.
This is not simply a matter of coincidence or conservatism.

The core of this phenomenon
is not a lack of information, nor insufficient investment.
Rather, it is a state produced by an excess of information and resources
being concentrated within the same optimization, evaluation,
and safety structures.

Within the current structure,
all changes are absorbed into average performance metrics,
reinterpreted as continuous improvement,
and flattened before they can constitute events.
As a result, events may occur,
but they cannot be recognized as events.

In this context, the way each company speaks about AGI
reveals an important tension.
Conceptually, AGI presupposes
qualitative transitions and events.
Yet along real-world implementation paths,
it is treated only as an extension of continuous optimization.
As a result, AGI exists only as
a goal that is “getting slightly closer,”
while AGI as an event has never once been observed.

This may be not because AGI is impossible,
but because within the current structure of information concentration,
even AGI cannot occur as an event.
AGI is spoken of,
but only forms of AGI that do not break the graph are permitted.

Therefore, the stable graphs and eventless state we observe today
are not evidence that “nothing has happened,”
but rather a signal that the structure itself—
which maintains the absence of events
despite this level of information concentration—
may constitute an abnormal state.

And yet,
why has an event never been allowed to occur,
even once?

I agree that the AI industry smooths out breakthroughs, treating AGI as gradual improvement rather than a discrete event.

In this context, do you think a true AGI breakthrough is ever possible, or will the system always suppress it?