# Achieving deterministic API output on language models - HOWTO

## Determinism: Solution first

I have found that if you set temperature to 0 or set top_p to 0, this actually does not get you a deterministic AI. There are short-circuits for these nonsensical values (divide by 0, or include no probability) where they are replaced with something else that is not true greedy sampling.

What one must do is use infinitesimal values for the parameters.

## A ridiculously small value approaching but not 0?

Yes.

Preposterous value: top_p=.0000000000000000000001
There is simply no way that this can be anything other than the top token.

An aside: A very very tiny temperature has the same effect, but it can permanently select a different token in your trials than top_p in the extremely rare case of identical probability top-2 logits (identical to the precision where logprobs reported are the same)

## Retrieve multiple chat completions:

``````import openai
import os
import json
openai.api_key = os.environ["KEY"]
model='gpt-3.5-turbo'
respstr = []
for i in range(10):
response = openai.ChatCompletion.create(
messages=[{"role":"system","content":"You are a deterministic AI assistant."},
model=model, top_p=.0000000000000000000001)
out = str(response)
print(respstr[i][:60] + "..")
``````

You can see I am asking it to write at length, and with no max tokens, leaving the finish up to the AI.

This is the kind of results I got at top_p = 0 for simply the length of the returns, sorted, where the output would diverge around paragraph seven:
2648,2648,2648,2648,2650,2613,2613,2601,2601,2601

## good top_p = 0.00000000000001

setting to the minuscule top_p value, even increasing the demand for more length:

[3520, 3520, 3520, 3520, 3520, 3520, 3520, 3520, 3520, 3520]

Now lets also compare all the strings to all the other strings, although we have high confidence:

``````lengths = [len(s) for s in respstr]
print(lengths)
flag = False
for i in range(len(respstr)):
for j in range(i+1, len(respstr)):
if respstr[i] != respstr[j]:
flag = True
for pos in range(min(len(respstr[i]), len(respstr[j]))):
if respstr[i][pos] != respstr[j][pos]:
print(f"\ni vs j: mismatch at char position {pos}")
start = max(0, pos - 30)
end = min(max(len(respstr[i]), len(respstr[j])), pos + 30)
print(f"{i}:{respstr[i][start:end]}")
print(f"{j}:{respstr[j][start:end]}")
break
if not flag:
print(f"{model}: All outputs match")
``````

## results

`gpt-3.5-turbo: All outputs match`

## are you sure thatâ€™s deterministic?

You can burn all the tokens you want to find out. I already paid a buck.

Our top_p = 0, verified again, is like this, again bad lengths:

``````[3520, 3520, 3416, 3520, 3423, 3520, 3520, 3520, 3520, 3520]

i vs j: mismatch at char position 1022
0: chronic diseases.

There are numerous varieties of apples a
2: chronic diseases.

There are thousands of apple varieties g

i vs j: mismatch at char position 1022
0: chronic diseases.

There are numerous varieties of apples a
4: chronic diseases.

There are thousands of apple varieties g
``````
and tons more mismatch dump

i vs j: mismatch at char position 1022
1: chronic diseases.

There are numerous varieties of apples a
2: chronic diseases.

There are thousands of apple varieties g

i vs j: mismatch at char position 1022
1: chronic diseases.

There are numerous varieties of apples a
4: chronic diseases.

There are thousands of apple varieties g

i vs j: mismatch at char position 1022
2: chronic diseases.

There are thousands of apple varieties g
3: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 2100
2:me. This can improve digestion and overall gut function.

Fu
4:me. This can improve digestion, reduce the risk of gastroint

i vs j: mismatch at char position 1022
2: chronic diseases.

There are thousands of apple varieties g
5: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 1022
2: chronic diseases.

There are thousands of apple varieties g
6: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 1022
2: chronic diseases.

There are thousands of apple varieties g
7: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 1022
2: chronic diseases.

There are thousands of apple varieties g
8: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 1022
2: chronic diseases.

There are thousands of apple varieties g
9: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 1022
3: chronic diseases.

There are numerous varieties of apples a
4: chronic diseases.

There are thousands of apple varieties g

i vs j: mismatch at char position 1022
4: chronic diseases.

There are thousands of apple varieties g
5: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 1022
4: chronic diseases.

There are thousands of apple varieties g
6: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 1022
4: chronic diseases.

There are thousands of apple varieties g
7: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 1022
4: chronic diseases.

There are thousands of apple varieties g
8: chronic diseases.

There are numerous varieties of apples a

i vs j: mismatch at char position 1022
4: chronic diseases.

There are thousands of apple varieties g
9: chronic diseases.

There are numerous varieties of apples a

Getting logprobs from chat models would let us see how close probabilities of (â€śthousandâ€ť vs â€śnumerousâ€ť) are, or (â€ś,â€ť vs â€śandâ€ť), and why output diverges at that point.

Now: to allow us temperature or top_p on embeddings, or some similar control of the internal state reported.

1 Like

Not true in all cases. Iâ€™ve created a small script to test the difference between the temperature and the top_p parameters.

Model gpt-3.5-turbo returned different responses even with top_p = 0.00000000000001 as you suggested. This behaviour was exactly the same when using temperature=0 or top_p=0.

Curiously, I received different outputs only when asked in portuguese. When my prompt was in english, the responses were all the same.

Here is the javascript that I used to test and received different outputs:

``````const OpenAI = require("openai");

(async function () {
openai = new OpenAI({
apiKey: "xxxxxxxxxxxxxxxxxxxxxx",
});

for (let i = 1; i <= 50; i++) {
openai.chat.completions
.create({
messages: [
{ role: "system", content: "VocĂŞ Ă© um assistente pessoal" },
{ role: "user", content: "Me diga uma coisa bem aleatĂłria com atĂ© 100 caracteres" },
],
model: "gpt-3.5-turbo",
//   temperature: 0,
top_p: 0.0000000000000000000001,
})
.then((result) => {
console.log(result.choices[0].message.content);
});
}
})();

``````

Yes, it turns out that the 3.5-turbo models, while indeed my small top_p setting locks in the top token more than actually setting to 0, there is still non-determinism in the logits, and there can be position changes of â€śtopâ€ť in long outputs.

I did a thorough investigation here, using the gpt-3.5-turbo-instruct model:

1 Like