GPT-4.5 preview does not appear to be deterministic

Temperature = 0
top_p = 0
seed = 42

Even a run of 10 identical prompts generate as high as 50% variability in long-form output.

Have tried different top_p settings in combo with temp=0, no change.

Is it possible that part of the optimisation process of serving Orion involved some sort of earlier truncation of possible logits?

1 Like

UPDATE: … so, changing max_tokens changes this… even though the actual response size stays roughly the same…

lowering max_tokens to be basically the expected response length makes it deterministic over 25 trials.

Increasing it to 16384 leads to virtually the exact same length of response… but returns to less than 50% determinism…

…this model is weird…

1 Like

What o1 Pro thinks of all this:

Below is some insight into why reducing max_tokens might have made your generations “lock in” to a single, repeated response—even when you still have temperature=0, top_p=0, and a fixed seed.

1. What max_tokens Typically Does

• max_tokens: This is simply an upper limit on how many tokens the model is allowed to generate in a single response. It doesn’t directly affect the probability distribution within that generation, but it can alter generation behavior in indirect ways.

Typically, if your prompt+system instructions already lead to a short response (maybe ~100 tokens max) and your max_tokens is set to something large (like 16k), it shouldn’t matter, because the model was never going to reach 16k tokens. The generation would stop well before that.

However, in some implementations or via certain server settings, having a very large max_tokens can nudge the model’s next-token loop into certain expansions or synonyms that it might not pick if it has a more constrained horizon. This effect is usually subtle—but it can lead to slight variations, especially if there are multiple equally likely continuations.

2. Why Reducing max_tokens Might Change Output Variability

  1. Different Internal Heuristics or “Lookahead”

Some model-serving frameworks do more than just “cut off” at max_tokens. In certain implementations, the sampling or beam-search (if used in part internally) might incorporate knowledge of how many tokens remain available. Even with temperature=0, the model’s next-token selection can subtly shift if it “knows” there’s limited space.

  1. Earlier Encounter with Stop Criteria

If your application or prompt has any kind of “stop sequence” or special condition that triggers an early stop—especially if the model is able to see that it’s reaching the limit—it might converge on the same single best path. A large max_tokens could lead it to keep generating (and pick from near-equal probabilities along the way), resulting in subtle variations.

  1. Interaction with Streaming or Partial Outputs

In streaming or chunked generation contexts, the model might “look for” an opportune place to complete a sentence or might break ties differently if it has a large runway of tokens vs. a short runway. This is very implementation dependent—some back-end systems do extra checks to avoid partial or awkward truncation.

  1. Non-Deterministic Ties Get Disambiguated

With a larger maximum, the model can proceed through multiple “tie” states; if each tie breaks in slightly different ways (due to floating-point nondeterminism), you see variations. But if the model sees it has fewer tokens to produce, it may more quickly converge on the single highest-likelihood path.

  1. Server/Framework Quirks

Sometimes the biggest factor is not the model itself but how the server infrastructure or API “thinks” about generating text. If the server tries to be more efficient (less partial caching, fewer micro-batches, or different synchronization) when the maximum token limit is small, that can reduce the floating-point nondeterminism or concurrency that leads to slight output variations.

3. Observing the “Stop” Finish Reason

In your logs, you see:

System Fingerprint: None

Finish Reason: stop

This suggests the model stopped because it hit a stop condition—most likely an explicit stop rule such as a newline sequence or user-specified stop token. When the model is quite certain it’s basically “done” or has matched the prompt style, it will produce identical completions over and over, especially at temperature=0.

If you had more runway (max_tokens=16k), there might be some stage at which it continues or picks a slightly different final phrase or synonyms. But the combined presence of a short horizon (max_tokens=2000) plus an enforced stop condition is effectively bounding the completion into a single “canonical” short block.

4. Why Exactly the Same Output (Versus Minor Differences)

  1. Temperature=0 and top_p=0

At “hard deterministic” sampling, the model always picks the highest-probability next token. If the distribution at each step is not tied or extremely close between tokens, you will get a single path with zero variation.

• The subtle variations you saw earlier might have happened only if some tokens were effectively tied. With the new environment or the shorter max token setting, any near-ties might be breaking in the same deterministic order.

  1. Prompt Repetition

Because each request is probably reusing the same prompt, the model sees no difference in context from one call to the next. No new tokens or user input that might shift the probabilities. So it just picks the same top-likelihood tokens in the same order every single time.

  1. Model Implementation Handling Ties

If the environment is pinned to CPU (or has some forced determinism in the GPU code) or is just consistently picking the same tie-break, you’ll see identical text. The large context limit might previously have triggered some micro-floating-point differences that are not showing up under the new config.

In a Nutshell

• Shorter max_tokens can lead to more “predictable” (identical) output, because the model “knows” it does not have the same runway to expand or vary.

• If any near-ties in the next token distribution do occur, the new environment or framework conditions might be resolving them consistently rather than divergently.

• With temperature=0 (and top_p=0), the model is already likely to pick the single highest-probability path. Any minor hardware-based nondeterminism that could have caused random synonyms at 16k tokens might be absent when the generation length is heavily constrained.

Hence, you get identical responses—no synonyms or tangents appear. The model just outputs the same standard block each time and then hits a “stop” reason.

In general large LLMs are not deterministic.

GPT 4.5 is not weird in this respect. It is much like prior models in this regard. This has come up regularly on here about other models.

1 Like

Hi Merefield - very familiar with the nature of transformers and their mechanics.

Obviously there are limits on architecture for determinism, but in point of fact, all transformers are indeed deterministic and the idea of the alternative is a misnomer.

Stochastics are introduced only during the decoding step and if greedy sampling is applied and a seed is fixed, the only other reasons for them not to be are down to race conditions on token logits.

The behaviour I am seeing is FAR outside of normal predictability and I’m sure it will be verified by the usual perplexity/entropy monitors like Galileo soon.

It has massive implications for enterprise use of this model.

That’s not true.

Mixture of experts architectures and associated algorithms in Production systems will seek to despatch queries to different parts of the network for efficiency and parallelism (ie improved throughput) resulting in a de facto lack of determinism.

Why?

And also if determinism was provided would Enterprise be prepared to pay the hugely increased cost for each transaction? (at least 2-4x)

This behavior doesn’t surprise me as stated.

Is this behavior consistent with comparable models like 4o or something new (tight test cases).

General question too. Is the behavior you are isolating divergent in the de facto uncapped response scenario …

The determinism thing gets old, so let me be really clear:

  • There is literally no requirement for an LLM to be stochastic and it can absolutely be deterministic, subject to:
  • Decisions made architecturally (which in your case would be not simply MoE, because there are plenty of ways to do that deterministically also, but is more your second point around efficiency and parallelism which absolutely DOES affect determinism - and which is ‘de facto’ architectural)
  • Race conditions at inference
  • Literal cosmic rays
  • Whatever else you add to the poor thing to make it sample weird

Having said ALL of that, obviously yes, we have been in a situation for a long time now where ALL of the above are in some effect most of the time and so true determinism is a rare thing.

HOWEVER.

A truly epic number of enterprise deployments of these models are defaulted to temperature =0 and often also changes in top_p in order to minimise variation. Even with these in place, variation still occurs.

But not in the order I am seeing it with this model. I’m not kidding when I say that even on maximum determinism settings, I’m seeing 50% variation between results unless I constrict token count. And these variations are substantive.

In enterprise deployments (not talking about a product here, talking more about those using one or rolling their own), this will make the model very difficult to become comfortable with.

From a technical perspective, what I am seeing would be consistent with the model containing an order of magnitude more data points in the latent space and as a result potentially having much finer precision required on the logits to separate options for the token pick. If Orion is as big as they are saying, it’s possible that the softmax function just won’t produce as many ‘clear winners’ for tokens as often, because the scope of geometrically proximal tokens is much higher.

Very much NOT comparable with other models.

On max determinism settings on 4o, I can expect quant evals to routinely produce identical outcomes in the order of 90-95% of the time. This doesn’t require me to modify total token count either.

Orion never seems to get anywhere near that repeatable unless I lock the total token count down significantly … which is in itself very weird behaviour because token budget has never directly been a factor in perplexity…

I’m curious about one thing:

I get why determinism might be useful if your inputs are very predictable, so you can eval on a very representative and perhaps even a complete set and judge the models performance and feel confident things are repeatable.

But why would you want to use an LLM if your inputs are very predictable and your outputs are deterministic? Why not just use an old school look up?

Another question:

Lets say we dial down the determinism from perfect and start to introduce a bit more variety in the output, what is the first issue you are going to encounter in your use case? Can you give a specific example?

1 Like

This is a reasonable question and there are actually many answers to it, but here’s the simple one:

LLMs in an enterprise context are more useful as The Regex God Would Write than they are as a creative output system.

When you go deep on this, there are an enormous number of metrics and maths from the dawn of transformer models - like perplexity and surprise - which are about not just idenitifying determinism, but identifying the degree to which token selection is itself uncertain or borderline.

These in turn are directly analagous to the study of the tolerance of a given prompt on a given model at temperature 0 to remain within predictable output parameters for a VAST ARRAY of input permutations.

This becomes especially true if you then either finetune the model for output or use structured outputs (or logit bias) thereby dramatically constraining the scope of probable token outputs.

Every single time you use gpt-4o-mini as a judge, a classifier or a JSON command router, you are relying on this behaviour.

Likewise, every time you build out agentic customer service systems, you are relying on very tight patterns of response even across a very large array of valid input permutations.

Hope that helps!