GPT-4.5 preview does not appear to be deterministic

jbc · March 1, 2025, 2:42am

Temperature = 0
top_p = 0
seed = 42

Even a run of 10 identical prompts generate as high as 50% variability in long-form output.

Have tried different top_p settings in combo with temp=0, no change.

Is it possible that part of the optimisation process of serving Orion involved some sort of earlier truncation of possible logits?

jbc · March 1, 2025, 3:43am

UPDATE: … so, changing max_tokens changes this… even though the actual response size stays roughly the same…

lowering max_tokens to be basically the expected response length makes it deterministic over 25 trials.

Increasing it to 16384 leads to virtually the exact same length of response… but returns to less than 50% determinism…

…this model is weird…

jbc · March 1, 2025, 3:51am

What o1 Pro thinks of all this:

Below is some insight into why reducing max_tokens might have made your generations “lock in” to a single, repeated response—even when you still have temperature=0, top_p=0, and a fixed seed.

1. What max_tokens Typically Does

• max_tokens: This is simply an upper limit on how many tokens the model is allowed to generate in a single response. It doesn’t directly affect the probability distribution within that generation, but it can alter generation behavior in indirect ways.

Typically, if your prompt+system instructions already lead to a short response (maybe ~100 tokens max) and your max_tokens is set to something large (like 16k), it shouldn’t matter, because the model was never going to reach 16k tokens. The generation would stop well before that.

However, in some implementations or via certain server settings, having a very large max_tokens can nudge the model’s next-token loop into certain expansions or synonyms that it might not pick if it has a more constrained horizon. This effect is usually subtle—but it can lead to slight variations, especially if there are multiple equally likely continuations.

2. Why Reducing max_tokens Might Change Output Variability

Different Internal Heuristics or “Lookahead”

Some model-serving frameworks do more than just “cut off” at max_tokens. In certain implementations, the sampling or beam-search (if used in part internally) might incorporate knowledge of how many tokens remain available. Even with temperature=0, the model’s next-token selection can subtly shift if it “knows” there’s limited space.

Earlier Encounter with Stop Criteria

If your application or prompt has any kind of “stop sequence” or special condition that triggers an early stop—especially if the model is able to see that it’s reaching the limit—it might converge on the same single best path. A large max_tokens could lead it to keep generating (and pick from near-equal probabilities along the way), resulting in subtle variations.

Interaction with Streaming or Partial Outputs

In streaming or chunked generation contexts, the model might “look for” an opportune place to complete a sentence or might break ties differently if it has a large runway of tokens vs. a short runway. This is very implementation dependent—some back-end systems do extra checks to avoid partial or awkward truncation.

Non-Deterministic Ties Get Disambiguated

With a larger maximum, the model can proceed through multiple “tie” states; if each tie breaks in slightly different ways (due to floating-point nondeterminism), you see variations. But if the model sees it has fewer tokens to produce, it may more quickly converge on the single highest-likelihood path.

Server/Framework Quirks

Sometimes the biggest factor is not the model itself but how the server infrastructure or API “thinks” about generating text. If the server tries to be more efficient (less partial caching, fewer micro-batches, or different synchronization) when the maximum token limit is small, that can reduce the floating-point nondeterminism or concurrency that leads to slight output variations.

3. Observing the “Stop” Finish Reason

In your logs, you see:
System Fingerprint: None

Finish Reason: stop
This suggests the model stopped because it hit a stop condition—most likely an explicit stop rule such as a newline sequence or user-specified stop token. When the model is quite certain it’s basically “done” or has matched the prompt style, it will produce identical completions over and over, especially at temperature=0.

If you had more runway (max_tokens=16k), there might be some stage at which it continues or picks a slightly different final phrase or synonyms. But the combined presence of a short horizon (max_tokens=2000) plus an enforced stop condition is effectively bounding the completion into a single “canonical” short block.

4. Why Exactly the Same Output (Versus Minor Differences)

Temperature=0 and top_p=0

At “hard deterministic” sampling, the model always picks the highest-probability next token. If the distribution at each step is not tied or extremely close between tokens, you will get a single path with zero variation.

• The subtle variations you saw earlier might have happened only if some tokens were effectively tied. With the new environment or the shorter max token setting, any near-ties might be breaking in the same deterministic order.

Prompt Repetition

Because each request is probably reusing the same prompt, the model sees no difference in context from one call to the next. No new tokens or user input that might shift the probabilities. So it just picks the same top-likelihood tokens in the same order every single time.

Model Implementation Handling Ties

If the environment is pinned to CPU (or has some forced determinism in the GPU code) or is just consistently picking the same tie-break, you’ll see identical text. The large context limit might previously have triggered some micro-floating-point differences that are not showing up under the new config.

In a Nutshell

• Shorter max_tokens can lead to more “predictable” (identical) output, because the model “knows” it does not have the same runway to expand or vary.

• If any near-ties in the next token distribution do occur, the new environment or framework conditions might be resolving them consistently rather than divergently.

• With temperature=0 (and top_p=0), the model is already likely to pick the single highest-probability path. Any minor hardware-based nondeterminism that could have caused random synonyms at 16k tokens might be absent when the generation length is heavily constrained.

Hence, you get identical responses—no synonyms or tangents appear. The model just outputs the same standard block each time and then hits a “stop” reason.

merefield · March 1, 2025, 3:57pm

In general large LLMs are not deterministic.

GPT 4.5 is not weird in this respect. It is much like prior models in this regard. This has come up regularly on here about other models.

Topic		Replies	Views
Achieving deterministic API output on language models - HOWTO API statistics	3	7701	December 21, 2023
Run same query many times - different results API	11	7630	December 21, 2023
Deterministic Results Impossible for GPT-4o API gpt-4 , chat-completion , api-temperature , seed	6	580	December 19, 2024
Why does the answer vary for the same question asked multiple times Community api	8	1573	May 22, 2024
Observing discrepancy in completions with temperature = 0 API	9	16926	February 6, 2024

GPT-4.5 preview does not appear to be deterministic

Related topics