Seed param and reproducible output do not work

Hi ,
I have writen the code according to the doc to check if I can get the same output by setting the “seed” param. but it seems the output still diff form requests. both the “gpt-4-1106-preview” model and “gpt-3.5-turbo” give unreproducible result in case of setting all the same input and seed.
Am I misunderstand the seed param usage ?

from openai import OpenAI
import difflib

# GPT_MODEL = "gpt-4-1106-preview"
GPT_MODEL = "gpt-3.5-turbo"
client = OpenAI(api_key='■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■U47cWLiN')

def get_chat_response(system_message: str, user_request: str, seed: int = None):
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_request},

    response =

    # print(response)

    response_content = response.choices[0].message.content
    system_fingerprint = response.system_fingerprint
    prompt_tokens = response.usage.prompt_tokens
    completion_tokens = (
            response.usage.total_tokens - response.usage.prompt_tokens

    return response_content

def compare_responses(previous_response: str, response: str):
    diff = difflib.Differ().compare(previous_response.splitlines(), response.splitlines())
    print('\n'.join(diff), end="")

def main():
    topic = "a happy journey to Mars"
    system_message = "You are a helpful assistant that generates short stories."
    user_request = f"Generate a short story about {topic}."

    seed = 12345

    response1 = get_chat_response(
        system_message=system_message, user_request=user_request, seed=seed,

    response2 = get_chat_response(
        system_message=system_message, user_request=user_request, seed=seed,

    compare_responses(response1, response2)


1 Like

I can confirm that not only do seeds not work, setting the temperature to 0 isn’t producing deterministic results either, so there may be a deeper issue affecting generations.

1 Like

according to the doc, the seed param is designed for
Reproducible outputs
, it seems that it is not working as the doc says

I have the same understanding as you.
I opened an issue on their Python SDK repository (openai-python/issues/708) although the issue is on the server side.

1 Like

I just checked their cookbook “deterministic_outputs_with_the_seed_parameter” again and it is mentioned that

If the seed, request parameters, and system_fingerprint all match across your requests, then model outputs will mostly be identical. There is a small chance that responses differ even when request parameters and system_fingerprint match, due to the inherent non-determinism of computers.

I guess the behavior is expected then :man_shrugging:

I will admit, I am somewhat confused by the “non-determinism of computers” part.

Ask in the Lounge and I can hopefully answer all of your questions on determinism and non-determinism.

After fiddling around with it I was only able to get the seed parameter to work with a single model: gpt-3.5-turbo-1106

Seeds do not appear to have any effect on any other new or old models.

Additionally, gpt-4-1106-preview will not behave deterministically even with the temperature set to 0, which definitely seems to be a bug.

1 Like

hi ,can you share more information about how to ask in the Lounge?

I try gpt-3.5-turbo-1106 for serveral times and can not get the totally same result. (the system_fingerprint are all the same). In my testing coding, the result come out that some parts like the first or maybe the second line are the same, but the rest of the result are different.

The Lounge is a category that is accessible once you reach Trust Level 3.

The default requirements are listed here

Why you are redirecting people to Discourse instead of answering here?

I am not redirecting people, only noting to Foxabilo that he should ask in the Lounge category, which he did.

1 Like

Agree, it definitely doesn’t work. After many tries gpt-3.5-turbo-1106 produces different results on each call.

I’ve figured out that issue with not reproducing same output with defined seed is related with calculating question embeddings which tend to produce varying results. So for testing purposes to achieve comparable results you need to retrieve once and then reuse (mock) embeddings for defined set of test questions. More about this issue you can find in thread below: