Feature Request: Deterministic Answer Option for Unit Testing

I have investigated the problem that causes non-determinism. The source or root is with the model vectors and logits produced. Their significant figure changes between runs are similar to embeddings.

When the values of likelihood of a token appearing every run continuously change by up to several percent, the top-ranking probability that comes from the language model can change.

Here, as part of my investigation with many different styles of API calls and scripts to process the data returned, we investigate gpt-3.5-turbo-instruct - a close cousin to the chat model. I prompt the task of writing 200 poems in the style of Poe. I also set the top_p to top_p=1e-16, ensuring that nothing but the top token value could be in that probability space. (I found I got better speed running at n>1, making multiple outputs with one call, and got the same results as separate calls).

Format of Report: mismatch discovered in comparisons of multiple runs:
[token sequence leading up to mismatch] - logprob of last logit
top: {shows the top-3 tokens and their logprob values}

i vs j: mismatch at token position 686
0:[‘ont’, ‘ill’, ‘ado’, “'s”, ’ Pun’] -1.4826658
0:top:{’ Pun’: -1.4826658, ’ V’: -1.4982907, ’ Madness’: -2.8107908}
1:[‘ont’, ‘ill’, ‘ado’, “'s”, ’ V’] -1.4914709
1:top:{’ V’: -1.4914709, ’ Pun’: -1.4914709, ’ Madness’: -2.8195958}

i vs j: mismatch at token position 416
0:[’ "‘, ‘The’, ’ Ha’, ‘unted’, ’ Castle’] -1.7199239
0:top:{’ Castle’: -1.7199239, ’ Mind’: -1.7667986, ’ Forest’: -2.1886737}
2:[’ "‘, ‘The’, ’ Ha’, ‘unted’, ’ Mind’] -1.6457075
2:top:{’ Mind’: -1.6457075, ’ Castle’: -1.7707075, ’ Forest’: -2.2394576}

i vs j: mismatch at token position 435
0:[‘The’, ’ Imp’, ’ of’, ’ the’, ’ Night’] -1.7654115
0:top:{’ Night’: -1.7654115, ’ Un’: -1.8591615, ’ Mind’: -2.7497866}
3:[‘The’, ’ Imp’, ’ of’, ’ the’, ’ Un’] -1.8257471
3:top:{’ Un’: -1.8257471, ’ Night’: -1.8257471, ’ Mind’: -2.575747}

i vs j: mismatch at token position 465
0:[‘The’, ’ Gold’, ’ Bug’, “'s”, ’ En’] -1.0015507
0:top:{’ En’: -1.0015507, ’ Quest’: -1.0640508, ’ R’: -2.6890507}
4:[‘The’, ’ Gold’, ’ Bug’, “'s”, ’ Quest’] -1.0133702
4:top:{’ Quest’: -1.0133702, ’ En’: -1.0446202, ’ R’: -2.73212}

We discover in almost every case, there is a token in the generation that falls to second place, even when we are not doing random sampling, but instead looking directly at the probability values.

Castle’: -1.7199239 changes to -1.7707075 and becomes #2
’ Night’: -1.7654115 changes to -1.8257471 and becomes #2

So despite having a top_p restraint to only return the best generation path, the “best” changes on us.

(Perhaps this is why OpenAI turned off logprobs in chat models.)


Now, were prior GPT-3 models deterministic?

Yes.

Let’s have the ‘text-curie-001’ instructGPT model do a similar task. Call:

    response = openai.Completion.create(
        prompt="Here's 50 new original poems by AI:\n\nPoem 1 of 50:",
        model=model, top_p=1e-16, temperature=1, max_tokens=2029, n=10, logprobs=3)

What is the length of outputs, and the report for mismatches, comparing ten runs near max tokens?

[2029, 2029, 2029, 2029, 2029, 2029, 2029, 2029, 2029, 2029]
text-curie-001: All outputs match

Conclusion:
Previous instructGPT models can complete to full context with no problem.


I tried other GPT-3 base completion models. Then I tried to compare them with the new base model replacements babbage-002 and davinci-002. The challenge I faced was despite their large context, they would quickly repeat.
"I am a poet, I am a poet, I am a poet, I am a poet, I am a
"I am a poet, I am a poet, I am a poet, I am a poet, I am a
"I am a poet, I am a poet, I am a poet, I am a poet, I am a
"I am a poet, I am a poet, I am a poet, I am a poet, I am a
"I am a poet, I am a poet, I am a poet, I am a poet, I am a

With more completion-style prompting of the replacement models and a 1-shot poem I ran multiple 8000 tokens runs up to 39 poems. The problem, it was just repeating poems very quickly, and after a bunch of repeats, the confidence of those repeats goes very high. So I didn’t bother going further into logit analysis on those.

So in conclusion, the 3.5 generation model that gives us logits, and thus we can conclude other 3.5 models proving to be non-deterministic, are a result of the fundamental changes in certainty emitted by the models, from architecture, hardware, design, compromise, or other reasons we have no way of answering experimentally.

1 Like