Why does the answer vary for the same question asked multiple times

I am trying to understand this and not able to get the bottom of it. If the LLMs are a probabilistic distribution of words. Why for the same sequence of word(question), the probability of the generated words change. The probability of a word given a word happened before should stay the same unless we are adding more words in the vocab. I think I am missing the concept somewhere. Still trying to research and read. But thought if anyone in the community can help guide me in this.

1 Like

If you were to actually examine the logits, the logprobs after softmax, you would have seen on GPT-3 models that the results were always the same. Something about the optimization of OpenAI models after or the hardware it runs on produces some variance in the output values between runs, a fraction of a percentage when examining the top probabilities.

Perhaps what you are wondering, though, is why you get significantly different responses each time.

That is due to token sampling.

The result of language inference is a certainty score assigned to each token in the the model language dictionary (token encoder). One could simply pick the top result for every token that is generated. However, it was discovered that such an output actually isn’t very natural or human.

Instead, the total scores are combined into a normalized probability distribution, where the sum of all certainties = 1.0, or 100%. Imagine a roulette wheel where the slot for “The” is wide because it is well predicted for a generation, while the token “zzarella” is a poor way to start a sentence, and gets an infinitesimal sliver of chance.

Thus in any trial you have words appearing with direct relation to the model’s prediction of likelihood at that position.

The direct correlation of certainty to probability can be altered with the sampling parameters top_p, and temperature.

Top-p is performed first. When it is set under 1.0, the least probable tokens in the tail of probability space are eliminated. 0.9 would allow only those that occupy the top 90% of probability mass.

Temperature then is a weighting, where reducing the value increases the mass of the most likely, and a high number can make values more equal.

In this manner, you can get a creative-sounding but not robotic AI, and with your own use of the sampling parameters, you can reduce some unlikely choices.


The probability doesn’t change, the sampling does.

If you take a six-sided die and roll it ten times, you’ll get one sequence. If you roll it ten more times, you’ll get another sequence.

The probabilities don’t change, you’re just pulling different samples.


Can these models not provide a way to eliminate sampling for some use cases. I thought temp and seed would do that, but it does not help. The answers are not the same in a different roll dice:(

A single top_p parameter of 0.000001 will be as close as you get to a deterministic repeatable output.

However, because of the aforementioned slight differences in token logprob values between runs on new models, that top token can still switch rank with a close second-place choice, particularly in ambiguous language position.

A single token difference can send the generation on a new path.

Seed also becomes unreliable, because selecting a random point such as 55.55% of 100% in the probability mass of all tokens may be occupied by a different token on a second trial.

Context matters: LLMs consider the entire sequence, not just the previous word. The meaning of a word changes based on previous words (e.g., “play” vs. “playwright”)

Does temperature=0 also give me similar level of determinism like top_p =0.000001. Or does top_p has an advantage over temperature

Temperature=0: Meant to pick the highest probability token, but there can be ties in probabilities. This can lead to different outputs for the same prompt even at temperature=0.
top_p=0.000001: Forces the model to consider only the absolute most probable token, achieving near-determinism.

Temperature actually can put some distance between the softmax result of two very similar tokens when set low. When GPT-3 could return reproducible results and I identified an input where the next token could give identical logprobs, trials at very low values (like 1e-9…) could cause one to be favored, meaning there was more precision that was not returned by API.

The drawback of temperature 0 is that it is a non-existent divide-by-zero, so it triggers some API replacement that is more like 0.01 - or is just completely broken in assistants or gpt-4o. That means that very poor tokens still have a slight chance of appearing.