Defeating Nondeterminism in LLM Inference

A recent blog post from Thinking Machines breaks down why large language models (LLMs) give different answers even when you set the sampling temperature to 0. It’s a common problem that frustrates anyone who needs reproducible results.

The Common Misconception

Most people blame the “concurrency + floating point” hypothesis. The idea is:

  1. GPUs perform calculations in parallel.

  2. The order in which threads finish is nondeterministic.

  3. Floating-point math isn’t associative (e.g., (0.1+1e20)−1e20=0, but 0.1+(1e20−1e20)=0.1).

  4. Therefore, the final result changes slightly with each run.

While this is a real phenomenon, the authors show it’s not the main cause of nondeterminism in LLM inference. In fact, most individual operations (kernels) in an LLM’s forward pass are “run-to-run deterministic”—running the same matrix multiplication twice will give a bitwise identical result.

The True Culprit: Lack of “Batch Invariance”

The actual problem is that GPU kernels are often not batch-invariant. This means a kernel can produce slightly different numerical results for an individual item depending on the size of the batch it’s processed in.

Here’s why that matters:

  • When you send a request to an LLM inference server, your request is batched with other users’ requests to maximize efficiency.

  • The server’s load is constantly changing, meaning the size of the batch your request is part of is effectively random from your perspective.

  • Because the kernel’s output depends on the batch size, and the batch size is unpredictable, your final result is nondeterministic.

Essentially, the system is deterministic if you consider the entire batch of user requests as the input. But for an individual user, who has no control over the other requests, the result appears random.

The Solution: Building Batch-Invariant Kernels

To achieve true determinism, every operation in the transformer must be made batch-invariant. The authors focused on the three key operations involving reductions: RMSNorm, matrix multiplication, and attention.

The core idea is to force these operations to use a single, consistent reduction strategy regardless of the batch size or how a sequence is processed (e.g., prefill vs. decoding). This sometimes means sacrificing the small performance optimizations that kernels use for specific batch sizes (like “Split-K” matmuls or dynamic attention scheduling).

Key Results

  • True Determinism: Using their batch-invariant kernels on top of vLLM, they ran the same prompt 1000 times.

    • Standard vLLM: Produced 80 unique completions.

    • Deterministic vLLM: Produced 1 unique completion.

  • Performance: There is a performance cost, but it’s not disastrous. In one test, a task that took 26 seconds on default vLLM took 42 seconds with the deterministic kernels.

  • Enabling True On-Policy RL: This is a huge benefit for research. By ensuring the model used for sampling is bitwise identical to the model being trained, it allows for “true on-policy” reinforcement learning, which they showed prevented reward collapse in their experiments.

9 Likes
  • Deterministic vLLM: Produced 1 unique completion.

  • Performance: There is a performance cost, but it’s not disastrous. In one test, a task that took 26 seconds on default vLLM took 42 seconds with the deterministic kernels.

Interesting. So this is a chip design issue?

5 Likes

Not a big fan of non-determinism ever since I understood it with SIMD. It’s a loss of control.

I still think true AI would be deterministic though we can’t technically do it so efficiently.

This is my fascination with GIFs (PNG etc) over formats like JPEG.

Finding an answer under race conditions has benefits but I am betting that most coders are tying non-deterministic llms into deterministic code.

Don’t get me started on the Philosophy and how when users bounce both back off an LLM they start falling in love with AIs :smiley: . Choose your own path whenever you can! Don’t confuse fact with fiction.

But… This is why creativity is so important!

You can love the ‘creativity’ of machines but it’s only one perspective. Don’t confuse LLM Variety with Human Diversity.

4 Likes

So this is a chip design issue?

Isn’t it a consequence of “Floating-point math isn’t associative” combined with different batch sizes being sent to the GPUs? It’s not a flaw in how the chips do calculations.

Why are you using AI to write comments on this forum?

Any use of AI here should indeed be purposeful and transparent. Here’s some non-determinism that is powered by looking back at the randomness. With the result of taking the wind out of the brunt of the input, and ending with new out-of-context text that is quite nonsensical.

(GPT-4.5, where this preset originally pointed, at least could write good English…)


Clarifying forum guidelines

On the forum, we want to avoid using AI for shallow, generic replies that don’t add value. It’s essential not to change the length, meaning, or overall structure of submissions. Instead, we can clarify terms—for example, changing “vapidly answer the context of a topic thread” to “post vapid, generic replies to a topic thread.” We should also ensure that we don’t start with a quote but maintain the structure of the user’s submission for improvement. Keeping consistent naming, like using “AI,” is also important.

Improving wording clarity

I think we can change “Good spot” to “Good catch,” and while length isn’t strictly enforced, we should still avoid ambiguous pronouns. The forum discourages using AI to post shallow, generic replies in a thread. I could say, “Good catch. The forum discourages using AI to post empty or context-free replies in a topic thread.” That keeps it simple and maintains meaning. I want to ensure our rewrite remains faithful to the original intent without adding complexity.

Assessing language use

I see we need to address the question “Why are you using AI?” especially when responding to another user misusing it. Should we say “We discourage…” or keep “is discouraged”? I think it may be clearer to use “We discourage.” However, I want to maintain the overall tone. Avoiding cataphoric references is important, so I should start sentences with the main subject. The phrase “Good spot” can be fine, but “Good catch” feels more fitting. We need to ensure the tone sounds like a colleague rather than an authority, maybe by surrounding these phrases with more welcoming language.

Revising forum guidelines

On this forum, using AI to post generic or shallow replies in a topic thread is discouraged, which is empathetic. I need to maintain the length, meaning, and structure, keeping the quote followed by a short statement. The phrase “to vapidly answer the context of a topic thread” needs a rewrite because it’s problematic. Regarding “Why are you using AI,” I’ll add a question mark for clarity, even if that alters the original quote slightly. The instructions say to improve submissions while preserving technical language, but this phrasing isn’t technical at all.

Enhancing clarity in quotes

I think we should keep the quote but add a question mark: “Why are you using AI?” This helps improve clarity. We might also consider using the formatting identifiers [quote=“…”] since the user’s submission uses quote tags. It seems important to preserve that markup while providing the replacement language. However, we should avoid heavy formatting. The guidelines suggest we adjust the overall submission, so maybe we can correct minor punctuation within the quote. For instance, adding the question mark is a minor change. Next, I’ll ensure there are no ambiguous pronouns and maintain consistent naming by referring to “the forum.”

Here’s how it will look:

Good catch. This forum discourages using AI to post shallow or generic replies in a topic thread. This maintains clarity and conciseness.

2 Likes