Why the API output is inconsistent even after the temperature is set to 0

zhan1130 · August 25, 2023, 4:47pm

I tested it on gpt4. The temperature is set to 0. The output can be pretty inconsistent, in the sense that not only the outputs in multiple runs with the identical system message and prompt are not identical (which in theory should be the case), but the semantic meaning of the output can also differ (which is even worst than non-identical text but same semantic meaning outputs).

Foxalabs · August 25, 2023, 6:38pm

Hi, can you link to some log outputs that show this and also give a code snippet of your api calling code?

zhan1130 · August 25, 2023, 7:48pm

As a simple example, try to let it perform arithmetic:

{'role':'system', 'content':'You are a math student.'},    
{'role':'user', 'content':'what is 123432*4234'}  ]

different runs will give different results. Clearly, even at temperature = 0, the output is not deterministic.

zhan1130 · August 25, 2023, 7:49pm

As a simple example, try to let it perform arithmetic:

{'role':'system', 'content':'You are a math student.'},    
{'role':'user', 'content':'what is 123432*4234'}  ]

different runs will give different results. Clearly, even at temperature = 0, the output is not deterministic.

_j · August 25, 2023, 8:11pm

First, there is no real meaning of “temperature 0”. That would be a divide by zero.
Other OpenAI APIs try to auto-tune at temperature 0, instead.

They likely take temperature 0 and give it a replacement value.

Instead, one might try a very low temperature, four zeroes after the decimal point.

{
  "index": 18,
  "message": {
    "role": "assistant",
    "content": "The product of 123432 and 4234 is 523,091,488."
  },
  "finish_reason": "stop"
},
{
  "index": 19,
  "message": {
    "role": "assistant",
    "content": "The product of 123432 and 4234 is 523,014,288."
  }

2 out of 20 gave an alternate answer.

add two more digits for temperature=0.00000001, and I get the same answers for n=40.

the other sampling parameter top-p, set approaching zero, can also make alternate logit choices near impossible.

zhan1130 · August 25, 2023, 8:16pm

Nice trick! By the way, could you explain what is the ‘auto-tune’ used in other APIs? Thanks!

_j · August 25, 2023, 8:25pm

audio transcription, for example:

temperature

number

Optional

Defaults to 0

The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

The original probabilities of logits might be deterministic, but the multinomial probability function that then rolls dice to pick one by statistics is not.

zhan1130 · August 25, 2023, 9:11pm

So I believe for gpt-4 API they did the same thing. Also, I think the crucial part is what is the threshold. Judging by the behaviors, this threshold for gpt-4 API is certainly higher than 0.00000001.

anon22939549 · August 26, 2023, 3:46am

First, this is just wrong. There very much is a “real meaning” of temperature T = 0—division by zero not withstanding.

The meaning of T = 0 is greedy sampling. The limiting behavior of, say softmax, is to return a one-hot encoded vector where the element with the highest sampling probability is mapped to 1 and all other elements are mapped to 0.

So to say,

there is no real meaning of “temperature 0”

is just flatly wrong.

Gonna need a big ol’ citation for that, sport. It is far easier and more appropriate to simply perform a greedy sample with T = 0.

I just fired off 50 runs in the playground and 50 out of 50 were the same.

¯⁠\⁠_⁠(⁠ツ⁠)⁠_⁠/⁠¯

Next I requested n = 50 responses through the API and they were all the same.

API results

Now, there has been some discussion regarding the model not being perfectly deterministic at T = 0, but none of that anywhere has ever suggested it’s because OpenAI is not actually using a temperature of 0.

The two theories I’ve seen that hold the most weight (for me) are,

GPT-4 is a sparse mixture-of-experts model, so when they batch tokens for evaluations, your input tokens can find themselves in a race condition with others. The end result becomes that the model is deterministic at the batch—not sequence—level. This is mentioned in the paper, From Sparse to Soft Mixtures of Experts.
Some parts of the GPU parallelism employed may be non-deterministic. For instance, the order in which values are summed can propagate floating point inaccuracies. It’s possible these inaccuracies are the root cause of the non-determinism.

I consider both of these to be infinitely more likely than “OpenAI isn’t doing greedy-sampling for T = 0.”

_j · August 26, 2023, 3:57am

Read GPT-2 source code.
I have.

Code self-documents how logits are selected.
Temperature increases the probability distance of normalized logits by dividing by a fraction (multiplying the reciprocal of temperature).

GPT-3 is just an incremental advance that followed shortly after, except for training on 200x the parameters. GPT-4 temperature obfuscated by the mixture of expert models that has been supposed.

_j · August 26, 2023, 6:53am

It took a good amount of futzing around, and the prompt is only happenstance from other stuff I was trying, but I have an interesting result.

If you really want to have fun with statistics, do trials on two top logit token outputs that match to 8 digits of accuracy!

"top_logprobs": [
 {
  " Aug": -2.4173014,
  " Oct": -2.4173014,
  " Mar": -2.440739,
  " Jan": -2.440739
 }
]

Aug = 8.92%
Oct = 8.92%
Jan = 8.71%
Mar = 8.71%

model: davinci-002
max_tokens: 1

"prompt": """In the square brackets are 1000 random ASCII characters, using 0-9a-zA-Z: [0-9a-zA-Z]{1000}.

share|improve this answer

edited"""

Let’s run 70 trials at multiple settings. Extract the first letter each time.

“top_p”: 0.0891, temperature 2
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO

“top_p”: 0.0892, temperature 2
OOOAAOAAAOOOAAAAOOAOAAOOAOOAOAOOAOAAOAAOOOOAAAOAAAOAAOOOAAAAOOOOAAOOAO

Thus, an exact top_p threshold where the next token is allowed.

Let’s continue:

“top_p”: 0.0892, temperature=0.000000001 (very A)
OAAAAAAAAAAAAOAAAAOAAAAAAAAAAAAAAAOAAOAAOAOAAAAAAAAAOOAAAAOOOOAAOAOAAA

“top_p”: 0.0892, temperature=0.0000000001 (all A)
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

And you won’t believe if we switch from miniscule to 0, a change:

First letter results of “top_p”: 0.0892, temperature=0.0
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO

Or even if we release the top_p restriction, a change again:

First letter results of “top_p”: 1.0, temperature=0.0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

The odd thing is temperature limit or top-p limit methods converge on a different token of the two allowed depending on setting.

Are they literally tied as far as top_p is concerned so the first seen is picked, while temperature is able to put distance between the probabilities?

Topic		Replies	Views
Achieving deterministic API output on language models - HOWTO API statistics	3	8064	December 21, 2023
Is the lower the temperature, the more correct the answer is? Prompting gpt-4 , chatgpt	5	6620	March 15, 2024
Why does the answer vary for the same question asked multiple times Community api	8	1932	May 22, 2024
Observing discrepancy in completions with temperature = 0 API	9	17185	February 6, 2024
Why is GPT-4 giving different answers with same prompt & temperature=0? API	6	16158	April 6, 2023

Why the API output is inconsistent even after the temperature is set to 0

Related topics