ChatCompletions are not deterministic even with seed set, temperature=0, top_p=0, n=1

I tried to get deterministic results from the OpenAI API via the python package. I failed with all attempts:

  • setting a seed
  • setting temperature to 0
  • setting top_p to 0
  • setting temperature / top_p to 0.000000000000001
  • setting n to 1
    And all kinds of combinations.

I tried it with gpt-4-0125-preview as well as gpt-4-1106-preview, both of which should support the seed argument (and don’t raise an error when trying).
Here is the script to reproduce the issue:

from openai import OpenAI

# from openai.types.chat.completion_create_params import ResponseFormat
from dotenv import load_dotenv
from pydantic import BaseModel, Field
import os
import json

load_dotenv()

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))


class PythonInput(BaseModel):
    query: str = Field(
        description="python script or command WIHTOUT COMMENTS which will be evaluated by the eval command.\nNever start variable names with numbers!'"
    )


parameters = PythonInput.schema()
parameters = {k: v for k, v in parameters.items() if k in ["type", "properties", "required", "definitions"]}
fn_schema_str = json.dumps(parameters)

tool_desc = (
    "> Tool Name: python-tool\n"
    "Tool Description: A Python shell. Use this to execute python commands.\nNever start variable names with numbers!\n"
    f"Tool Args: {fn_schema_str}\n"
)


long_system_message = """You are working with a python shell that has a pandas DataFrame.
The name of the dataframe is `df`.
You have pandas and numpy available as pd and np.This is the description of df:

This DataFrame contains balance sheet data of the company in question for several years.
Each row contains the balance sheet data of a year.
The column 'Year' is very important. It is sorted DESC and contains the year for which the report is valid.
The column 'Currency' is also important, it specifies the currency in which the numbers are reported in.

Some rules to follow:

1. Group calculations per Year and include the Year column in your answer.
2. If you are asked how a metric developed, respond with the absolute values of the last couple of years, not the percentage changes.
3. Return information from the context in the form of a markdown table.
4. Include the Currency column if possible.
5. Don't use dropna on df, you could lose important information.
6. If calculating diffs: use .diff(-1) to calculate differences to previous year. And don't include the Currency column for the diff, only in the assign part. Good example:
df[['Year', 'Total_Assets', 'Cash_Bank_Deposits', 'AS30', 'Goodwill', 'AS32', 'AS40']].diff(-1).assign(Year=df['Year'], Currency=df['Currency'])

This is the description of the relevant columns:
Column 'Trade_Receivables': Trade Receivables 
Column 'Intragroup_Receivables': Intragroup Receivables
Column 'Other_Receivables': Other Receivables
Column 'Subtotal_Inventory': Subtotal Inventory
Column 'Total_Assets': Total Assets 
Column 'Currency': Currency in which all figures for this balance sheet are reported

This is the result of `print(df.head())`:
   Other_Receivables Currency  Year  Total_Assets  Intragroup_Receivables  Subtotal_Inventory  Trade_Receivables
0       3.025e+09      EUR  2022  2.3510e+10                       0        8.789101e+08       7.813689e+08
1       1.801e+09      EUR  2019  2.2478e+10                       0        7.458830e+08       8.145821e+08

## Tools
You have access to tools. You are responsible for using
the tools in any sequence you deem appropriate to complete the task at hand.
This may require breaking the task into subtasks and using different tools
to complete each subtask.

You have access to the following tools:
{tool_desc}

If you want to call a tool, respond with the following json-format:
{{"tool_name": <tool_name, one of [python-tool]>, "tool_args": <json of tool_args, corresponding to the schema>}}

Use the tool to answer the questions posed to you.""".format(
    tool_desc=tool_desc
)
responses = []

for i in range(10):
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": long_system_message,
            },
            {
                "role": "user",
                "content": "For all of the following lines, calculate its share of total assets: Subtotal Liquid Assets, Trade Receivables, Intragroup Receivables, Other Receivables, Subtotal Inventory, Subtotal Tangible Fixed Assets, Subtotal Intangible Fixed Assets, Subtotal Financial Fixed Assets",
            },
        ],
        model="gpt-4-0125-preview",
        seed=42,
        # top_p=0,
        # temperature=0,
        top_p=0.000000000000001,
        temperature=0.000000000000001,
        n=1,
        response_format={"type": "json_object"},
    )
    responses.append(chat_completion.choices[0].message.content)
print("The number of unique different responses is: ", len(set(responses)))
1 Like

That is correct.

It’s a known issue that chat completions are not perfectly deterministic.

2 Likes

IMHO, it’s not only “not perfectly”, there are 6-9 unique answers when queried 10 times with very different lenghts as well (155-444 completion tokens). That is more random that I would like it to have without a seed :wink:

Related to 685769

Very similar issue, but not only is the message.content not deterministic, also the message.tool_calls.function.arguments attribute is not deterministic either.

Tested with gpt-4-0125-preview and gpt-4-1106-preview.

Here is the code to reproduce the issue:

from openai import OpenAI

# from openai.types.chat.completion_create_params import ResponseFormat
from dotenv import load_dotenv
from pydantic import BaseModel, Field
import os
import json

load_dotenv()

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
messages = [
    {
        "role": "system",
        "content": "You are working with a python shell that has a pandas DataFrame.\\nThe name of the dataframe is `df`.\\nYou have pandas and numpy available as pd and np.This is the description of df:\\n\\nThis DataFrame contains balance sheet data of the company in question for several years.\\nEach row contains the balance sheet data of a year.\\nThe column 'Year' is very important. It is sorted DESC and contains the year for which the report is valid.\\nThe column 'Currency' is also important, it specifies the currency in which the numbers are reported in.\\n\\nSome rules to follow:\\n\\n1. Group calculations per Year and include the Year column in your answer.\\n2. If you are asked how a metric developed, respond with the absolute values of the last couple of years, not the percentage changes.\\n3. Return information from the context in the form of a markdown table.\\n4. Include the Currency column if possible.\\n5. Don't use dropna on df, you could lose important information.\\n6. If calculating diffs: use .diff(-1) to calculate differences to previous year. And don't include the Currency column for the diff, only in the assign part. Good example:\\ndf[['Year', 'Total_Assets', 'Cash_Bank_Deposits', 'AS30', 'Goodwill', 'AS32', 'AS40']].diff(-1).assign(Year=df['Year'], Currency=df['Currency'])\\n\\nThis is the description of the relevant columns:\\nColumn 'Trade_Receivables': Trade Receivables \\nColumn 'Intragroup_Receivables': Intragroup Receivables\\nColumn 'Other_Receivables': Other Receivables\\nColumn 'Subtotal_Inventory': Subtotal Inventory\\nColumn 'Total_Assets': Total Assets \\nColumn 'Currency': Currency in which all figures for this balance sheet are reported\\n\\nThis is the result of `print(df.head())`:\\n   Other_Receivables Currency  Year  Total_Assets  Intragroup_Receivables  Subtotal_Inventory  Trade_Receivables\\n0       3.025e+09      EUR  2022  2.3510e+10                       0        8.789101e+08       7.813689e+08\\n1       1.801e+09      EUR  2019  2.2478e+10                       0        7.458830e+08       8.145821e+08\\n\\nUse the tool to answer the questions posed to you.",
    },
    {
        "role": "user",
        "content": "For all of the following lines, calculate its share of total assets: Subtotal Liquid Assets, Trade Receivables, Intragroup Receivables, Other Receivables, Subtotal Inventory, Subtotal Tangible Fixed Assets, Subtotal Intangible Fixed Assets, Subtotal Financial Fixed Assets",
    },
]

tools = [
    {
        "type": "function",
        "function": {
            "name": "python-tool",
            "description": "A Python shell. Use this to execute python commands.\\nNever start variable names with numbers!\\n",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "title": "Query",
                        "description": "python script or command WIHTOUT COMMENTS which will be evaluated by the eval command.\\nNever start variable names with numbers!'",
                        "type": "string",
                    }
                },
                "required": ["query"],
            },
        },
    }
]

function_arguments = []
for i in range(10):
    chat_completion = client.chat.completions.create(
        messages=messages,
        model="gpt-4-0125-preview",
        tools=tools,
        tool_choice="auto",
        seed=42,
        # top_p=0,
        # temperature=0,
        top_p=0.000000000000001,
        temperature=0.000000000000001,
        n=1,
    )
    function_arguments.append(chat_completion.choices[0].message.tool_calls[0].function.arguments)
print("The number of unique different function_arguments is: ", len(set(function_arguments)))

The output of the script is:

The number of unique different function_arguments is: 9

Which means that in that case only one function call was not unique.

2 Likes

Have you tried this more recently or with an older version (say, gpt3.5-turbo)?
Wondering if you found any solution so far.

Then, how does it compare to other methods in the openai package?

What is the point of having a Temperature of 0 if it doesn’t produce the same result every time? (all others being equal)

Because it’s as deterministic as possible given the architecture of their infrastructure.

I suspect the non-determinism is due to the subtleties of race conditions in a massive parallel, distributed system which is optimized for throughput.

3 Likes

@joel.gotsch I have actually experimented with this myself a while ago and found all kinds of interesting things, like this non-determinism with sampling “removed” and seed set.

Inititially I waved off the infra/precision affects because the variance was too high, but as @anon22939549 mentioned, the infra is so massive that at this scale, once multiplied, small precision errors could yield to massive variability.

2 Likes

It is unfortunate and makes the temperature setting less useful.