Clarifications on setting temperature = 0

Hi,

I was trying to get my model to generate some JSON output (which i then parse later) and have noticed that it seems to be fine when the temperature is like 0.3, but not when it’s 0 (though i have not had the chance to do a deep dive). This has led me to question what actually happens when we set the temperature to 0. The documentation says

If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

I have 3 questions:

  1. Does this mean the temperature is dynamic and varies from token to token when the temp is set to 0.0? Or the model just picks a temperature and sticks with it throughout generation.
  2. Does this mean it’s possible that if I manually set the temperature to e.g. 0.01, I might get a more deterministic output than what the model chooses when it’s set to 0?
  3. Is it possible to elaborate on what these thresholds are?
2 Likes

That is the description for temperature for whisper.

temperature = 0

just causes the model to do use greedy sampling when selecting the next token.

https://platform.openai.com/docs/api-reference/chat/create#chat-create-temperature

temperature
number or null

Optional
Defaults to 1
What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

We generally recommend altering this or top_p but not both.

1 Like

While this is indeed the case with open-source models like those from Meta, OpenAI’s models still generate different results each time, even when the temperature is set to 0. Therefore, the OP also has a point.

The comment that the referenced document section is different is correct.

However, the OP was seeking clarification on what actually happens when the temperature is set to 0.

So if possible, could you please provide an official explanation or evidence that OpenAI’s models do perform greedy sampling when the temperature is set to 0, like Meta’s models?

Such clarification with evidence or an official explanation would be the true answer to the OP’s question.

2 Likes

There’s also other sources of randomness like the state of the RNG and race conditions in multithreaded code.

The evidence that,

T = 0 \equiv the greedy algorithm

is the fact that when the temperature T approaches 0 in a stochastic process, the probability distribution becomes increasingly deterministic. At T = 0, the process will always select the option with the highest probability, effectively turning into a greedy algorithm.

This occurs because:

  1. The temperature parameter T controls the randomness in sampling.
  2. As T decreases, the probability differences between options are amplified.
  3. At T = 0, this amplification becomes infinite, making the highest probability option overwhelmingly dominant.
  4. Consequently, the process always chooses the most probable option, which is the definition of greedy sampling.

This behavior is consistent across various stochastic algorithms, including softmax sampling and Boltzmann exploration.

There’s no reason to do anything different here but if they were going to, it would be stated as they do in the whisper documentation.

3 Likes

I think that was an excellent explanation in response to the OP’s question.
The aforementioned points should be correct, so it’s unlikely that the OP was asking about Whisper.
This clarification should have addressed the OP’s question well!

1 Like

Welcome to the community!

This is unfortunately a bit of a contentious topic, because nobody really knows.

As elmstedt mentions, this part of the documentation refers to the whisper class of speech-to-text models. I recall something similar being written next to the temperature parameter (not the dynamic part) in the docs, but I could be mistaken.

A prominent former user who performed extensive tests on this considered it best practice to set temperature to something like 0.0000001 instead of 0.

As mentioned, the implementation details here aren’t public at this time, so we can only assume what goes on in the background.

The temperature algorithm is goes approximately something like this:

p_i = e^{\frac{y_i}{T}} * k

  • k is just a normalizing factor; sum of all terms, so that the sum of all p, \sum{p} ends up becoming being 1.
  • y_i is some value you get out of the model pipeline for the token i. Higher value means the model tends more towards that token, lower means less so.

since T, temperature, is the denominator in a fraction, setting that to 0 would obviously break everything.

So the question is, what does the model do?

Elmstedt discovered, I believe, that the last gen llama models deal with T=0 by simply skipping the temperature calculation, looking for the highest y_i and returning that particular token.

That is obviously one way to do it, but it’s also possible that they simply rewrite T=0 to some other low value, like T=0.01 or something. That would be another way to do it, and might improve maintainability.

I don’t think we can know this for certain unless something gets leaked.


But to solve your ultimate problem:

I would actually try to indeed do a deeper dive here, because experience would say you might be dealing with a prompt issue - and the fact that it “works” with a higher temp just means that you got lucky so far, but that it might break in production.

If you think you have a randomness issue:

What I do, in spite of the documentation, is set both temperature and top_p to zero.

top_p: start with the highest probability p_i and gather most likely p s until \sum{p_{gathered}} > top_p. If top_p is 0, it can only gather the single most likely p_i.

So regardless of whether temperature 0 gets overwritten or not, I think this is one of the best methods to deal with getting rid of most potential or actual randomness. This might be overkill, but you pay for sampling compute either way, so there’s no reason to underkill it.

What you absolutely can’t control:

OpenAI is known to perform ninja edits in the background - modifying their models within a version without notifying anyone. They’re calling these versions “fingerprints” (or something, even that is complicated), and they can cause your results to change.

There’s also some alleged inherent non-deterministic nature to these LLMs. This can theoretically be true but generally shouldn’t be an operational concern with good prompting and choosing the right params.

Seed

There’s also a seed parameter that you can tweak, but it should only affect the sampler. If we effectively decouple sampler effects (temp = 0, top_p = 0) seed shouldn’t have any effect. I think the seed parameter is operationally irrelevant, other than for logging purposes.

(more info on seed and fingerprints: How to make your completions outputs consistent with the new seed parameter | OpenAI Cookbook)


TL;DR:

It’s a worthwhile endeavor to take a closer look at the sampling parameters. I think understanding them and their historical context will make you a significantly better LLM dev.

But I think your current issue is probably more likely to be related to your prompt in this case, and it might be a good idea to to take a look at the failure modes you’re experiencing and building countermeasures against them :slight_smile:

4 Likes

Great write-up

Since we don’t know what they do, and because they don’t really specify I would think Occam’s Razor would be suitable here. It is easier (and performant, even if it’s by a small amount) to just nullify the function rather than change the variable and then run the function using an extremely low number.

Agreed. I mean, why not :person_shrugging:

1 Like

I just googled razors, and I found this applicable gem :laughing:

  • Alder’s razor (also known as Newton’s flaming laser sword): If something cannot be settled by experiment or observation, then it is not worthy of debate.[4]

Philosophical razor - Wikipedia

2 Likes