When querying several time the /completions API with a temperature of 0, I still observe some differences in the responses. Those differences are usually subtle but can be huge for more complicated prompts. See example below for reproduction.

I interpret the temperature is the temperature of the softmax layer used for the sampling (the last layer of the transformer), and using a temperature of 0 will basically make it an argmax: the next token chosen is the one with the highest score. And so I expect it to be deterministic.

But maybe it’s not exactly the meaning of temperature for this API?

This is a very interesting question that has been around for some time. In my view, the most comprehensive answer was given here: A question on determinism

Even though you are right in your hypothesis @sam_nabla, it doesn’t hold empirically. Even with a greedy decoding strategy, small discrepancies regarding floating point operations lead to divergent generations. In simpler terms: when the top-two tokens have very similar log-probs, there’s a non-zero probability of choosing the least probable one due to the finite number of digits that you’re using for multiplying probs and storing them.

It should also be noted that, as the decoding occurs in an autoregressive way, once you have picked a different token the whole generated sequence will diverge, as this choice affects to the probability of generating every subsequent token.