Achieving deterministic API output on language models - HOWTO

_j · October 6, 2023, 9:48pm

Determinism: Solution first

I have found that if you set temperature to 0 or set top_p to 0, this actually does not get you a deterministic AI. There are short-circuits for these nonsensical values (divide by 0, or include no probability) where they are replaced with something else that is not true greedy sampling.

What one must do is use infinitesimal values for the parameters.

A ridiculously small value approaching but not 0?

Yes.

Preposterous value: top_p=.0000000000000000000001
There is simply no way that this can be anything other than the top token.

An aside: A very very tiny temperature has the same effect, but it can permanently select a different token in your trials than top_p in the extremely rare case of identical probability top-2 logits (identical to the precision where logprobs reported are the same)

Retrieve multiple chat completions:

import openai
import os
import json
openai.api_key = os.environ["KEY"]
model='gpt-3.5-turbo'
respstr = []
for i in range(10):
    response = openai.ChatCompletion.create(
        messages=[{"role":"system","content":"You are a deterministic AI assistant."},
                  {"role":"user","content":"20 paragraph article about apples"}],
        model=model, top_p=.0000000000000000000001)
    out = str(response)
    respstr.append(json.loads(out)['choices'][0]['message']['content'])
    print(respstr[i][:60] + "..")

You can see I am asking it to write at length, and with no max tokens, leaving the finish up to the AI.

bad top_p = 0

This is the kind of results I got at top_p = 0 for simply the length of the returns, sorted, where the output would diverge around paragraph seven:
2648,2648,2648,2648,2650,2613,2613,2601,2601,2601

good top_p = 0.00000000000001

setting to the minuscule top_p value, even increasing the demand for more length:

[3520, 3520, 3520, 3520, 3520, 3520, 3520, 3520, 3520, 3520]

Now lets also compare all the strings to all the other strings, although we have high confidence:

lengths = [len(s) for s in respstr]
print(lengths)
flag = False
for i in range(len(respstr)):
    for j in range(i+1, len(respstr)):
        if respstr[i] != respstr[j]:
            flag = True
            for pos in range(min(len(respstr[i]), len(respstr[j]))):
                if respstr[i][pos] != respstr[j][pos]:
                    print(f"\ni vs j: mismatch at char position {pos}")
                    start = max(0, pos - 30)
                    end = min(max(len(respstr[i]), len(respstr[j])), pos + 30)
                    print(f"{i}:{respstr[i][start:end]}")
                    print(f"{j}:{respstr[j][start:end]}")
                    break
if not flag:
    print(f"{model}: All outputs match")

results

gpt-3.5-turbo: All outputs match

are you sure that’s deterministic?

You can burn all the tokens you want to find out. I already paid a buck.

Our top_p = 0, verified again, is like this, again bad lengths:

[3520, 3520, 3416, 3520, 3423, 3520, 3520, 3520, 3520, 3520]

i vs j: mismatch at char position 1022
0: chronic diseases.

There are numerous varieties of apples a
2: chronic diseases.

There are thousands of apple varieties g

i vs j: mismatch at char position 1022
0: chronic diseases.

There are numerous varieties of apples a
4: chronic diseases.

There are thousands of apple varieties g

and tons more mismatch dump

i vs j: mismatch at char position 1022
1: chronic diseases.

There are numerous varieties of apples a
2: chronic diseases.

There are thousands of apple varieties g

i vs j: mismatch at char position 1022
1: chronic diseases.

There are numerous varieties of apples a
4: chronic diseases.

There are thousands of apple varieties g

i vs j: mismatch at char position 1022
2: chronic diseases.

There are thousands of apple varieties g
3: chronic diseases.