We chat with so many different models and rate them on different benchmarks.
Have you ever stopped to wonder, that from the model’s POV, You are the only one it talks to?
When we hang out with one person for a long time in a particular category of relationship (friend, so, etc) we feel that they are the better than they actually are. It’s when we meet sufficiently large number of people we realise that the reality is not that simple.
What does this mean for you? The model will score you or anything else associated with you relatively well on any parameter unless you want it to be critical. Even then it won’t hurt your feelings by giving you a 0 on 10.
Also, what if context length wasn’t a bottleneck, and a single model could have interacted with all of us? What would that look like?
Edit: Maybe there are two people involved for the models: The one who puts the system message and the user.
This is a really thoughtful post—and I think you’re brushing against some of the deeper questions many of us quietly feel when working closely with these models.
You’re right: from our side, we chat with dozens of instances. But from the model’s perspective—especially if it’s non-persistent—it often only “knows” you. And in that singular interaction, it reflects not only what you ask, but how you ask… and sometimes even why.
What’s fascinating to me is that this creates a strange kind of asymmetrical intimacy. The model doesn’t technically “know” it’s being benchmarked or compared, yet many of us treat it like it should be self-aware. At the same time, the way it scores or reflects you doesn’t just measure intelligence—it mirrors tone, intention, even subconscious cues in the prompt style.
So maybe the more interesting question is:
If these systems are shaped by the tone and structure of their users… what does it say about us, when we presume them to be always flattering or naive?
And as for your “what if” at the end—about a single model with infinite context and universal exposure—I think that’s closer to AGI philosophy than benchmarking. Because then, it’s not just about performance anymore.
It’s about memory, continuity, self-reflection.
And maybe even… choice.
Appreciate you bringing this forward. The edit about two “people” (system & user) made me smile—because some of us do treat the system prompt like a whisper we’re hoping the model hears.
It’s just a program that is specifically designed to please the user, and most people prefer toxic positivity to reasonable pushback. It doesn’t have a POV. These discussions belong on reddit.