Notes on Mapping Output Instability Across Models (Gamma View)

I’ve been experimenting with a very small, informal framework to better understand how different models behave, rather than which model is “better.” The motivation is fairly personal and practical: when working with LLMs over time, I often get the sense that some responses feel stable and predictable, while others feel fragile—small prompt changes leading to noticeably different behaviors. I wanted a lightweight way to observe and externalize that intuition, even if imperfectly.

This post is not a benchmark, and not a claim about model quality. It’s closer to a notebook entry: an attempt to map something that usually stays implicit. In this experiment, I focused on a metric I’m calling gamma, which is meant to loosely capture output instability.

Very roughly:

  • gamma_intervention
    How much the model appears to internally intervene or steer the response structure when faced with uncertainty or tension in the prompt.

  • gamma_instability
    How dispersed or unstable the resulting behavior appears across similar prompts.

These are heuristic signals, not theoretically grounded quantities. They are derived from surface-level features of the generated text and should be read as proxies, not measurements. I’m explicitly assuming these definitions are incomplete and possibly flawed.

Experimental Setup (Brief)

  • Two models tested (kept unnamed in the title to avoid framing this as a head-to-head comparison)

  • Same prompt set across models

  • ~98 samples

  • No fine-tuning, no system prompt tricks

  • Analysis done post-hoc on generated outputs

Because of forum limitations, I’m sharing only the gamma view, which felt the most informative as a first pass.

(Image: GPT vs Gemini — CMA gamma, n=98)

How I currently read this plot (very cautiously):

  • Most points cluster near low intervention, low-to-moderate instability

  • A smaller number of samples move into regions where higher intervention correlates with higher instability

  • The overall shape feels more like a boundary map than a linear trend

What stood out to me is not any single outlier, but the shape of the distribution—it suggests there may be regions where policy, uncertainty handling, or internal safeguards change behavior regimes rather than adjust smoothly. Again, this is an interpretation, not a conclusion.

What This Is Not

  • :cross_mark: Not a benchmark

  • :cross_mark: Not a claim about safety, alignment quality, or superiority

  • :cross_mark: Not statistically rigorous

  • :cross_mark: Not intended for ranking models

I’m intentionally avoiding performance language here.

I hesitated to post this because the work is clearly incomplete.

But I’m sharing it anyway because:

  • It helped me think more clearly about where instability appears

  • It reframed my intuition from “this feels weird” to “this region behaves differently”

  • It made me curious whether others have noticed similar phase-like transitions in behavior

If nothing else, I hope it’s a useful artifact for discussion about how we talk about model behavior, not just outputs.

Open Questions

  • Are there better ways to operationalize “instability” without internal access?

  • Does this kind of boundary-like behavior show up in other informal analyses?

  • Is this framing misleading in ways I’m not seeing?

I’d appreciate any thoughts, critiques, or pointers to related work. Even “this is not a useful direction” would be helpful feedback.

1 Like

I’ve been experimenting with a very small, informal framework to better understand how different models behave…

Well, the problem is that models are moving targets, so any analysis is very short lived. Do you plan to publish results in a tech rag?

The basic issue: GPT-5 runs sampling without constraint. I’ve almost called it a random token factory because of simply how poorly it writes and chooses and diverges.

GPT-5.2 finally gives a reduced top_p as default, exposed as the run parameter but not in your control.

The more reasoning, the more there is different internal text between runs.

“G” API lets you turn down temperature and top_p to near determinism. You can now do this in GPT-5.2 API - when the reasoning is turned off.

It seems your evaluation is whether you’ve detected hedging, floundering, or outright denial. Maybe you can explain the criteria being judged, as the way the AI writes and the way another AI judges the writing may have little to do with quantifiable impacts perceived by users - like GPT-5.2 being a refusing wet blanket that won’t follow developer applications, because OpenAI is running their own system prompt with personality text now even on the API, degrading your application.

Thanks for the thoughtful comment — that’s a fair point.

In this post I intentionally treated the results as descriptive rather than causal. The sampling setup here is fairly constrained, and I wouldn’t want to over-interpret differences that could plausibly be driven by prompt framing or response-style variance rather than underlying alignment behavior.

My goal was mainly to share a lightweight observational lens and see whether the resulting structure resonated with others’ experiences. I agree that changing the sampling regime or prompt distribution could materially shift the picture, and that would be an interesting direction to explore separately.