Determinism: Solution first
I have found that if you set temperature to 0 or set top_p to 0, this actually does not get you a deterministic AI. There are short-circuits for these nonsensical values (divide by 0, or include no probability) where they are replaced with something else that is not true greedy sampling.
What one must do is use infinitesimal values for the parameters.
A ridiculously small value approaching but not 0?
Yes.
Preposterous value: top_p=.0000000000000000000001
There is simply no way that this can be anything other than the top token.
An aside: A very very tiny temperature has the same effect, but it can permanently select a different token in your trials than top_p in the extremely rare case of identical probability top-2 logits (identical to the precision where logprobs reported are the same)
Retrieve multiple chat completions:
import openai
import os
import json
openai.api_key = os.environ["KEY"]
model='gpt-3.5-turbo'
respstr = []
for i in range(10):
response = openai.ChatCompletion.create(
messages=[{"role":"system","content":"You are a deterministic AI assistant."},
{"role":"user","content":"20 paragraph article about apples"}],
model=model, top_p=.0000000000000000000001)
out = str(response)
respstr.append(json.loads(out)['choices'][0]['message']['content'])
print(respstr[i][:60] + "..")
You can see I am asking it to write at length, and with no max tokens, leaving the finish up to the AI.
bad top_p = 0
This is the kind of results I got at top_p = 0 for simply the length of the returns, sorted, where the output would diverge around paragraph seven:
2648,2648,2648,2648,2650,2613,2613,2601,2601,2601
good top_p = 0.00000000000001
setting to the minuscule top_p value, even increasing the demand for more length:
[3520, 3520, 3520, 3520, 3520, 3520, 3520, 3520, 3520, 3520]
Now lets also compare all the strings to all the other strings, although we have high confidence:
lengths = [len(s) for s in respstr]
print(lengths)
flag = False
for i in range(len(respstr)):
for j in range(i+1, len(respstr)):
if respstr[i] != respstr[j]:
flag = True
for pos in range(min(len(respstr[i]), len(respstr[j]))):
if respstr[i][pos] != respstr[j][pos]:
print(f"\ni vs j: mismatch at char position {pos}")
start = max(0, pos - 30)
end = min(max(len(respstr[i]), len(respstr[j])), pos + 30)
print(f"{i}:{respstr[i][start:end]}")
print(f"{j}:{respstr[j][start:end]}")
break
if not flag:
print(f"{model}: All outputs match")
results
gpt-3.5-turbo: All outputs match
are you sure that’s deterministic?
You can burn all the tokens you want to find out. I already paid a buck.
Our top_p = 0, verified again, is like this, again bad lengths:
[3520, 3520, 3416, 3520, 3423, 3520, 3520, 3520, 3520, 3520]
i vs j: mismatch at char position 1022
0: chronic diseases.
There are numerous varieties of apples a
2: chronic diseases.
There are thousands of apple varieties g
i vs j: mismatch at char position 1022
0: chronic diseases.
There are numerous varieties of apples a
4: chronic diseases.
There are thousands of apple varieties g
and tons more mismatch dump
i vs j: mismatch at char position 1022
1: chronic diseases.
There are numerous varieties of apples a
2: chronic diseases.
There are thousands of apple varieties g
i vs j: mismatch at char position 1022
1: chronic diseases.
There are numerous varieties of apples a
4: chronic diseases.
There are thousands of apple varieties g
i vs j: mismatch at char position 1022
2: chronic diseases.
There are thousands of apple varieties g
3: chronic diseases.
There are numerous varieties of apples a
i vs j: mismatch at char position 2100
2:me. This can improve digestion and overall gut function.
Fu
4:me. This can improve digestion, reduce the risk of gastroint
i vs j: mismatch at char position 1022
2: chronic diseases.
There are thousands of apple varieties g
5: chronic diseases.
There are numerous varieties of apples a
i vs j: mismatch at char position 1022
2: chronic diseases.
There are thousands of apple varieties g
6: chronic diseases.
There are numerous varieties of apples a
i vs j: mismatch at char position 1022
2: chronic diseases.
There are thousands of apple varieties g
7: chronic diseases.
There are numerous varieties of apples a
i vs j: mismatch at char position 1022
2: chronic diseases.
There are thousands of apple varieties g
8: chronic diseases.
There are numerous varieties of apples a
i vs j: mismatch at char position 1022
2: chronic diseases.
There are thousands of apple varieties g
9: chronic diseases.
There are numerous varieties of apples a
i vs j: mismatch at char position 1022
3: chronic diseases.
There are numerous varieties of apples a
4: chronic diseases.
There are thousands of apple varieties g
i vs j: mismatch at char position 1022
4: chronic diseases.
There are thousands of apple varieties g
5: chronic diseases.
There are numerous varieties of apples a
i vs j: mismatch at char position 1022
4: chronic diseases.
There are thousands of apple varieties g
6: chronic diseases.
There are numerous varieties of apples a
i vs j: mismatch at char position 1022
4: chronic diseases.
There are thousands of apple varieties g
7: chronic diseases.
There are numerous varieties of apples a
i vs j: mismatch at char position 1022
4: chronic diseases.
There are thousands of apple varieties g
8: chronic diseases.
There are numerous varieties of apples a
i vs j: mismatch at char position 1022
4: chronic diseases.
There are thousands of apple varieties g
9: chronic diseases.
There are numerous varieties of apples a
Getting logprobs from chat models would let us see how close probabilities of (“thousand” vs “numerous”) are, or (“,” vs “and”), and why output diverges at that point.
Now: to allow us temperature or top_p on embeddings, or some similar control of the internal state reported.