Setting temperature to zero does not work

Hello,

I am trying to get a deterministic answer to my prompt by setting the temperature down to zero. It works in the playground but when I try to do it through the API I get different answers to the same prompt.

Here is a simplified version of the code I use :

import openai

openai.api_key = 'mykey'

response = openai.ChatCompletion.create(
  model="gpt-4-turbo-preview",
  messages=[
    {
      "role": "system",
      "content": "Our goal is to assign each billionaire a \"self-made score\". The score ranges from 1 to 10, with 1 to 5 meaning an individual inherited most of his or her wealth and 6 to 10 meaning he or she built their company or established their fortune.\n\nHere is a breakdown of self-made scores, along with representative examples for each score.\n\n- 1/10 : Inherited fortune but not working to increase it.\nExamples : Dagmar Dolby ; Alice Walton\n- 2/10 : Inherited fortune and has a role in managing it.\nExamples : Jim Walton ; Laurene Powell Jobs\n- 3/10 : Inherited fortune and helping to increase it marginally.\nExamples : Carl Cook ; Jimmy Haslam\n- 4/10 : Inherited fortune and increasing it in a meaningful way.\nExamples : Abigail Johnson ; Bubba and Dan Cathy\n- 5/10 : Inherited small or medium-size business and made it into a ten-digit fortune.\nExamples : Rupert Murdoch ; Micky Arison\n- 6/10 : Hired hand who didn’t create the business:\nExamples : Meg Whitman ; Steve Ballmer\n- 7/10 : Self-made who got a head start from wealthy parents and moneyed background.\nExamples : William Ackman ; Reed Hastings\n- 8/10 : Self-made who came from a middle-class or upper-middle-class background.\nExamples : Mark Zuckerberg ; Jeff Bezos\n- 9/10 : Self-made who came from a largely working-class background; rose from little to nothing.\nExamples : Sergey Brin ; Judy Love\n- 10/10 : Self-made who not only grew up poor but also overcame significant obstacles.\nExamples : George Soros ; Harold Hamm\n\nI need you to help me score other billionaires based on this logic. Your response should start with a short phrase (15 words) about the origin of the fortune of the billionaire and MUST end with the self-made score following this format: \"Self-made score: (score)/10\". Do not add anything else.\n\nWhich score would you give to: ""
    },
    {
      "role": "user",
      "content": "Larry Ellison"
    }
  ],
  temperature=0,
  max_tokens=64,
  top_p=1
)

answer = response.choices[0].message['content'].strip()
print(f"RĂ©ponse: {answer}\n")

For instance I got those two different answers with this prompt :

Larry Ellison co-founded Oracle Corporation, a software company, starting with a small investment. Self-made score: 9/10

Larry Ellison co-founded Oracle Corporation, a software company, starting with a small investment. Self-made score: 8/10

Does anyone have an idea ?

Thanks a lot for your help !

1 Like

Hi there - that’s a fully normal behaviour. Models are non-deterministic. Even with the temperature set at 0, you are unlikely to get the exact same answer.

In your case, if you are looking for a score, a common strategy that you can apply to deal with this is to run the same query multiple times and then either take the average of scores or the top 2 highest/lowest.

2 Likes

The best parameters you can hope for:

  temperature=1e-19,
  top_p=1e-9,
  seed=1234,

Just asking the AI and getting logprobs of a single-digit answer, here it is more certain:

  "logprobs": {
    "content": [
      {
        "token": "8",
        "logprob": -0.52091527,
        "bytes": [
          56
        ],
        "top_logprobs": [
          {
            "token": "8",
            "logprob": -0.52091527,
            "bytes": [
              56
            ]
          },
          {
            "token": "9",
            "logprob": -0.98327184,
            "bytes": [
              57
            ]
          },
          {
            "token": "7",
            "logprob": -3.6280663,
            "bytes": [
              55
            ]
          },
          {
            "token": "10",
            "logprob": -5.4347277,
            "bytes": [
              49,
              48
            ]
          },
          {
            "token": "6",
            "logprob": -6.9738894,
            "bytes": [
              54
            ]

If you wanted to be really tricky, you could take such a 1-token answer, extract all the integers returned in the top-5 logprobs, and weight them by the logprob converted to probability for a new float answer.

4 Likes

I think if you want a deterministic answer, you need a more deterministic prompt. Your scores are defined very qualitatively, using terminology whose meaning is highly subjective, or at least reasonable people could disagree as to its meaning. Terms like: most, helping, marginally, meaningful, managing, hired hand, head start, wealthy, middle-class, upper-middle-class, working-class, poor, largely, little to nothing, significant obstacles. LLMs behave like humans, so . they won’t be deterministic about non-deterministic concepts. If you like your prompt, I agree with @jr.2509 and @_j about doing some post-completion manipulation. Alternatively, you can use more deterministic prompts, such as specifying the ratio of inherited fortune to current value of the business; the tiltes someone obtained in the business (like CEO or CFO); the level of education of the billionaire’s parents, grandparents, and the billionaire herself or himself; the billionarie’s income from investments vs. salary; whether the business is a public company or a private company; the number of employees reporting to the billionaire; whether the billionarie had tutors growing up or went to private school; the average property value of homes in the neighborhood where the billionarie was born; whether the billionaire experienced childhood tragedy like the death of a parent. Essentially, criteria that are established objectively with a number or a yes/no answer. Then, the scores are likely to be more predictable. Giving examples of other billionaires is a good tactic, but reasonable people can disagree as to whether billionarie X is more like billionaire A or B. Imagine that each time you run the query, you are asking a different person to follow the prompt. A temperature of 0 means each person will eschew creativity and follow your instructions more rigidly, but it’s still a different person answering a highly subjective question, and they will inevitably interpret the qualitative criteria a little differently than any other person. That’s less likely to happen, though, if your criteria are numerical or in the yes/no style.

2 Likes

Thanks a lot for your answers ! My surprise came from the fact that I do seem to get deterministic answers in the playground and not through the API. But maybe it was just luck then ! Seems like I will have to run the same query multiple times, too bad for my savings :slight_smile:

Thanks, this is really interesting.

Could you explain me your choice for the seed parameter ? I am not sure what this is supposed to achieve in this case.

Yes I agree but I wanted to use the “self-made score” as defined by Forbes for US billionaires and apply it to other countries so I kind of have to stick with their definition. Seems like the best way is to run it multiple times for each billionaire then :slight_smile:

I show use of seed just because it is also a parameter that is supposed to make models more deterministic.

Reuse of the same multinomial sampling seed parameter that is newly-provided by OpenAI uses the same randomness values for the token selection process upon subsequent runs. It is not expected that this affects other parts of models that may also include some random-like element (such as differing GPU precisions or allowable computational errors in hardware, or the run switch transformer of MoE), but the entire architecture is not published since GPT-2.

This parameter doesn’t do as much for you when the output that comes from language model inference softmax includes changing decision boundaries between tokens between each run due to AI non-determinism, making the same cutoff threshold of cumulative probability mass to pick a token potentially make a different choice.

TL;DR: the picker can pick different words despite the best API parameters to prevent this, because what comes out of AI computation changes between identical API calls with OpenAI now.