I’ve been experimenting with a very small, informal framework to better understand how different models behave, rather than which model is “better.” The motivation is fairly personal and practical: when working with LLMs over time, I often get the sense that some responses feel stable and predictable, while others feel fragile—small prompt changes leading to noticeably different behaviors. I wanted a lightweight way to observe and externalize that intuition, even if imperfectly.
This post is not a benchmark, and not a claim about model quality. It’s closer to a notebook entry: an attempt to map something that usually stays implicit. In this experiment, I focused on a metric I’m calling gamma, which is meant to loosely capture output instability.
Very roughly:
-
gamma_intervention
How much the model appears to internally intervene or steer the response structure when faced with uncertainty or tension in the prompt. -
gamma_instability
How dispersed or unstable the resulting behavior appears across similar prompts.
These are heuristic signals, not theoretically grounded quantities. They are derived from surface-level features of the generated text and should be read as proxies, not measurements. I’m explicitly assuming these definitions are incomplete and possibly flawed.
Experimental Setup (Brief)
-
Two models tested (kept unnamed in the title to avoid framing this as a head-to-head comparison)
-
Same prompt set across models
-
~98 samples
-
No fine-tuning, no system prompt tricks
-
Analysis done post-hoc on generated outputs
Because of forum limitations, I’m sharing only the gamma view, which felt the most informative as a first pass.
(Image: GPT vs Gemini — CMA gamma, n=98)
How I currently read this plot (very cautiously):
-
Most points cluster near low intervention, low-to-moderate instability
-
A smaller number of samples move into regions where higher intervention correlates with higher instability
-
The overall shape feels more like a boundary map than a linear trend
What stood out to me is not any single outlier, but the shape of the distribution—it suggests there may be regions where policy, uncertainty handling, or internal safeguards change behavior regimes rather than adjust smoothly. Again, this is an interpretation, not a conclusion.
What This Is Not
-
Not a benchmark -
Not a claim about safety, alignment quality, or superiority -
Not statistically rigorous -
Not intended for ranking models
I’m intentionally avoiding performance language here.
I hesitated to post this because the work is clearly incomplete.
But I’m sharing it anyway because:
-
It helped me think more clearly about where instability appears
-
It reframed my intuition from “this feels weird” to “this region behaves differently”
-
It made me curious whether others have noticed similar phase-like transitions in behavior
If nothing else, I hope it’s a useful artifact for discussion about how we talk about model behavior, not just outputs.
Open Questions
-
Are there better ways to operationalize “instability” without internal access?
-
Does this kind of boundary-like behavior show up in other informal analyses?
-
Is this framing misleading in ways I’m not seeing?
I’d appreciate any thoughts, critiques, or pointers to related work. Even “this is not a useful direction” would be helpful feedback.
