Why is GPT-4 giving different answers with same prompt & temperature=0?

This is my code for calling the gpt-4 model:

messages = [
    {"role": "system", "content": system_msg},
    {"role": "user", "content": req}
]

response = openai.ChatCompletion.create(
        engine = "******-gpt-4-32k",
        messages = messages,
        temperature=0,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

answer = response["choices"][0]["message"]["content"]

Keeping system_msg & req constant, with temperature=0, I get different answers. I got 3 different answers when I last ran this 10 times for instance. The answers are similar in concept, but differ in semantics.

I was expecting the exact same answer every time. Why is this happening?

You can try to reduce top_p as well as temperature. It will reduce the word choices even further.

But why is the top choice changing every time gpt-4 model is called again?

You may have two words with the same score in the top_p list of word/token choices.

In really simplistic terms, temperature tells it what percent of the top words it can pick the next word from - based in the sum of their probabilities. (This is not 100% correct, but helps explain the next part)

A temperature of zero tells it to pick the top word from the top_p lost. But the list of words could be huge and some words may be at the top of the list with the same probability score - especially short and common words.

What top_p does, is it reduces the list of word choices. So 1 means the list has every word possible, ranked high probability to lowest.

A top_p of 0 basically tells the ai to throw away all the words except the very top ones - so temperature has less to pick from. Even a value of 0.1 will make a big difference

By reducing the number of words on the list, the temperature has less to pick from.

So try a temperature of 0 and a top_p of 0.1 and see if it makes a difference

No harm in trying…

Note: I used “words” in my description above to make it easier to explain - but it is actually “tokens” that are being ranked two different words could start with the same token.

2 Likes

Thanks for your detailed reply! :smiley:

I tried again with top_p = 0.1
Don’t see any difference though.

What you said made sense to me for temp > 0, that it’ll have less options with lower values of top_p. But since at temp=0 it simply picks the highest one, I don’t understand how changing top_p would have a big difference in this case.

I’m still seeing 3 different answers if I run it 10 times.

Sorry it didn’t work. I was hoping it would make it even more rigid in it’s response

GPT generation process is non-deterministic by default. You can see a more thorough discussion on this on this thread (and its associated link). TL;DR: the problem is that the token with “highest probability” is ill-defined due to the finite number of digits that you’re using for multiplying probs and storing them. Hope it helps :slight_smile:

2 Likes