Welcome to the community!
This is unfortunately a bit of a contentious topic, because nobody really knows.
As elmstedt mentions, this part of the documentation refers to the whisper
class of speech-to-text models. I recall something similar being written next to the temperature parameter (not the dynamic part) in the docs, but I could be mistaken.
A prominent former user who performed extensive tests on this considered it best practice to set temperature to something like 0.0000001 instead of 0.
As mentioned, the implementation details here aren’t public at this time, so we can only assume what goes on in the background.
The temperature algorithm is goes approximately something like this:
p_i = e^{\frac{y_i}{T}} * k
- k is just a normalizing factor; sum of all terms, so that the sum of all p, \sum{p} ends up becoming being 1.
- y_i is some value you get out of the model pipeline for the token i. Higher value means the model tends more towards that token, lower means less so.
since T, temperature, is the denominator in a fraction, setting that to 0 would obviously break everything.
So the question is, what does the model do?
Elmstedt discovered, I believe, that the last gen llama models deal with T=0 by simply skipping the temperature calculation, looking for the highest y_i and returning that particular token.
That is obviously one way to do it, but it’s also possible that they simply rewrite T=0 to some other low value, like T=0.01 or something. That would be another way to do it, and might improve maintainability.
I don’t think we can know this for certain unless something gets leaked.
But to solve your ultimate problem:
I would actually try to indeed do a deeper dive here, because experience would say you might be dealing with a prompt issue - and the fact that it “works” with a higher temp just means that you got lucky so far, but that it might break in production.
If you think you have a randomness issue:
What I do, in spite of the documentation, is set both temperature and top_p to zero.
top_p: start with the highest probability p_i and gather most likely p s until \sum{p_{gathered}} > top_p. If top_p is 0, it can only gather the single most likely p_i.
So regardless of whether temperature 0 gets overwritten or not, I think this is one of the best methods to deal with getting rid of most potential or actual randomness. This might be overkill, but you pay for sampling compute either way, so there’s no reason to underkill it.
What you absolutely can’t control:
OpenAI is known to perform ninja edits in the background - modifying their models within a version without notifying anyone. They’re calling these versions “fingerprints” (or something, even that is complicated), and they can cause your results to change.
There’s also some alleged inherent non-deterministic nature to these LLMs. This can theoretically be true but generally shouldn’t be an operational concern with good prompting and choosing the right params.
Seed
There’s also a seed parameter that you can tweak, but it should only affect the sampler. If we effectively decouple sampler effects (temp = 0, top_p = 0) seed shouldn’t have any effect. I think the seed parameter is operationally irrelevant, other than for logging purposes.
(more info on seed and fingerprints: How to make your completions outputs consistent with the new seed parameter | OpenAI Cookbook)
TL;DR:
It’s a worthwhile endeavor to take a closer look at the sampling parameters. I think understanding them and their historical context will make you a significantly better LLM dev.
But I think your current issue is probably more likely to be related to your prompt in this case, and it might be a good idea to to take a look at the failure modes you’re experiencing and building countermeasures against them