Non-deterministic probabilities *for first generated token* in chat.completion?

I understand that sampling is not deterministic and I’m not going to get the same generated text each time I run the same code, even when setting the temperature to really low values.
However, I was under the impression that the probabilities of the first generated token, that depend only on the prompt (which is fixed), should be stable across different runs if temperature, seed, etc, are kept constant. The sampled token may be different, but the probabilities wouldn’t.
Yet this is not the behavior I’m experiencing: each run produces wildly different probabilities for the first token, even when using fixed seed, temperature, accounting for system_fingerprint, etc. Is this the expected behavior?

Attaching screenshot displaying this behavior

This is normal and should be expected—Even if it shouldn’t be expected.

There are other sources of randomness which are very difficult to eliminate, particularly the extreme parallelism of the transformer architecture.

The suspicion many have is that the variability with which threads finish execution inside the GPU is the root-cause of this non-deterministic result.

While it is possible to account for this, it potentially entails a lot of waiting for threads to finish which, as you might imagine, carries with it a huge efficiency cost.

At present, you should never expect the results to be perfectly deterministic.

1 Like

Thanks for your answer!
I understand where you’re coming from. It is however somewhat unsettling that, for exactly the same prompt, the probability of a No answer can range from a 51% to a 97%. That’s a really big gap…

For sure, I totally get that!

The way I would interpret those kind of results though is that there is room to improve your context management.

Either through optimizing the system message, rewriting the user message, or pulling in additional context and performing RAG. The idea being that if there truly should be only one output for a particular input and you’re experiencing that much variability, the issue is you haven’t given the model the scaffolding it needs to do the job you are asking of it.

Something I repeat often around here is too “meet the models where they are.” So, do that—give the model the support it needs to succeed.

2 Likes

That’s the “efficency” of the gpt-4-turbo and new gpt-3.5-turbo-0125 for you. We don’t know why what should be the same calcuations yield different output from the AI.

It is a symptom that was not seen on prior GPT-3 AI models where across hundreds of trials to investigate sampling, you never had to doubt that logprobs would be the same. Even if you found a top-2 answer that returned exactly the same logprob value via the API, you would never see them switch position or return different values.

You can use completions and get more reliable answers out of gpt-3.5-turbo-instruct than you show from the latest chat models.

Completions prompt with few-shot:

Here is an extremely clever and logical artificial intelligence. We set the AI so it only responds with “true” or “false” in lower case as a string, formatted in JSON with the response as the value of a single key “response”. Then we asked some conditional reasoning statements and questions, and the AI got all 100% correct:

Input: You are a human.
AI: {“response”: “false”}
Input: You are intelligent.
AI: {“response”: “true”}
Input: Do you like puppies?
AI: {“response”: “true”}
Input: Do you like flowers?
AI: {“response”: "

56.60% to 43.33% “true” (after I’ve made the two token possibilites extremely clear) and the same logprob to that precision again and again.

Bizarrely, if the AI doesn’t like puppies, flowers goes to 78%. Remove the extra blank line separating the pretext and the example chat, and “false” wins by 1%. Even when it is the same between runs, it is a random word machine.