Achieving deterministic API output on language models - HOWTO

Determinism: Solution first

I have found that if you set temperature to 0 or set top_p to 0, this actually does not get you a deterministic AI. There are short-circuits for these nonsensical values (divide by 0, or include no probability) where they are replaced with something else that is not true greedy sampling.

What one must do is use infinitesimal values for the parameters.

A ridiculously small value approaching but not 0?

Yes.

Preposterous value: top_p=.0000000000000000000001
There is simply no way that this can be anything other than the top token.

An aside: A very very tiny temperature has the same effect, but it can permanently select a different token in your trials than top_p in the extremely rare case of identical probability top-2 logits (identical to the precision where logprobs reported are the same)

Retrieve multiple chat completions:

import openai
import os
import json
openai.api_key = os.environ["KEY"]
model='gpt-3.5-turbo'
respstr = []
for i in range(10):
    response = openai.ChatCompletion.create(
        messages=[{"role":"system","content":"You are a deterministic AI assistant."},
                  {"role":"user","content":"20 paragraph article about apples"}],
        model=model, top_p=.0000000000000000000001)
    out = str(response)
    respstr.append(json.loads(out)['choices'][0]['message']['content'])
    print(respstr[i][:60] + "..")

You can see I am asking it to write at length, and with no max tokens, leaving the finish up to the AI.

bad top_p = 0

This is the kind of results I got at top_p = 0 for simply the length of the returns, sorted, where the output would diverge around paragraph seven:
2648,2648,2648,2648,2650,2613,2613,2601,2601,2601

good top_p = 0.00000000000001

setting to the minuscule top_p value, even increasing the demand for more length:

[3520, 3520, 3520, 3520, 3520, 3520, 3520, 3520, 3520, 3520]

Now lets also compare all the strings to all the other strings, although we have high confidence:

lengths = [len(s) for s in respstr]
print(lengths)
flag = False
for i in range(len(respstr)):
    for j in range(i+1, len(respstr)):
        if respstr[i] != respstr[j]:
            flag = True
            for pos in range(min(len(respstr[i]), len(respstr[j]))):
                if respstr[i][pos] != respstr[j][pos]:
                    print(f"\ni vs j: mismatch at char position {pos}")
                    start = max(0, pos - 30)
                    end = min(max(len(respstr[i]), len(respstr[j])), pos + 30)
                    print(f"{i}:{respstr[i][start:end]}")
                    print(f"{j}:{respstr[j][start:end]}")
                    break
if not flag:
    print(f"{model}: All outputs match")

results

gpt-3.5-turbo: All outputs match

are you sure that’s deterministic?

You can burn all the tokens you want to find out. I already paid a buck.

Our top_p = 0, verified again, is like this, again bad lengths:

[3520, 3520, 3416, 3520, 3423, 3520, 3520, 3520, 3520, 3520]

i vs j: mismatch at char position 1022
0: chronic diseases.

There are numerous varieties of apples a
2: chronic diseases.

There are thousands of apple varieties g

i vs j: mismatch at char position 1022
0: chronic diseases.

There are numerous varieties of apples a
4: chronic diseases.

There are thousands of apple varieties g
and tons more mismatch dump

i vs j: mismatch at char position 1022
1: chronic diseases.

There are numerous varieties of apples a
2: chronic diseases.

There are thousands of apple varieties g

i vs j: mismatch at char position 1022
1: chronic diseases.

There are numerous varieties of apples a
4: chronic diseases.

There are thousands of apple varieties g

i vs j: mismatch at char position 1022
2: chronic diseases.

There are thousands of apple varieties g
3: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 2100
2:me. This can improve digestion and overall gut function.

Fu
4:me. This can improve digestion, reduce the risk of gastroint

i vs j: mismatch at char position 1022
2: chronic diseases.

There are thousands of apple varieties g
5: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 1022
2: chronic diseases.

There are thousands of apple varieties g
6: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 1022
2: chronic diseases.

There are thousands of apple varieties g
7: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 1022
2: chronic diseases.

There are thousands of apple varieties g
8: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 1022
2: chronic diseases.

There are thousands of apple varieties g
9: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 1022
3: chronic diseases.

There are numerous varieties of apples a
4: chronic diseases.

There are thousands of apple varieties g

i vs j: mismatch at char position 1022
4: chronic diseases.

There are thousands of apple varieties g
5: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 1022
4: chronic diseases.

There are thousands of apple varieties g
6: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 1022
4: chronic diseases.

There are thousands of apple varieties g
7: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 1022
4: chronic diseases.

There are thousands of apple varieties g
8: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 1022
4: chronic diseases.

There are thousands of apple varieties g
9: chronic diseases.

There are numerous varieties of apples a

Getting logprobs from chat models would let us see how close probabilities of (“thousand” vs “numerous”) are, or (“,” vs “and”), and why output diverges at that point.

Now: to allow us temperature or top_p on embeddings, or some similar control of the internal state reported.

1 Like

Not true in all cases. I’ve created a small script to test the difference between the temperature and the top_p parameters.

Model gpt-3.5-turbo returned different responses even with top_p = 0.00000000000001 as you suggested. This behaviour was exactly the same when using temperature=0 or top_p=0.

Curiously, I received different outputs only when asked in portuguese. When my prompt was in english, the responses were all the same.

Here is the javascript that I used to test and received different outputs:

const OpenAI = require("openai");

(async function () {
  openai = new OpenAI({
    apiKey: "xxxxxxxxxxxxxxxxxxxxxx",
  });

  for (let i = 1; i <= 50; i++) {
    openai.chat.completions
      .create({
        messages: [
          { role: "system", content: "VocĂŞ Ă© um assistente pessoal" },
          { role: "user", content: "Me diga uma coisa bem aleatória com até 100 caracteres" },
        ],
        model: "gpt-3.5-turbo",
        //   temperature: 0,
        top_p: 0.0000000000000000000001,
      })
      .then((result) => {
        console.log(result.choices[0].message.content);
      });
  }
})();

Yes, it turns out that the 3.5-turbo models, while indeed my small top_p setting locks in the top token more than actually setting to 0, there is still non-determinism in the logits, and there can be position changes of “top” in long outputs.

I did a thorough investigation here, using the gpt-3.5-turbo-instruct model:

1 Like