Seed param and reproducible output do not work

Hi ,
I have writen the code according to the doc to check if I can get the same output by setting the “seed” param. but it seems the output still diff form requests. both the “gpt-4-1106-preview” model and “gpt-3.5-turbo” give unreproducible result in case of setting all the same input and seed.
Am I misunderstand the seed param usage ?

from openai import OpenAI
import difflib

# GPT_MODEL = "gpt-4-1106-preview"
GPT_MODEL = "gpt-3.5-turbo"
client = OpenAI(api_key='■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■U47cWLiN')


def get_chat_response(system_message: str, user_request: str, seed: int = None):
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_request},
    ]

    response = client.chat.completions.create(
        model=GPT_MODEL,
        messages=messages,
        seed=seed,
        temperature=0.7
    )

    # print(response)

    response_content = response.choices[0].message.content
    system_fingerprint = response.system_fingerprint
    prompt_tokens = response.usage.prompt_tokens
    completion_tokens = (
            response.usage.total_tokens - response.usage.prompt_tokens
    )

    print(response_content+"\n")
    print(f"system_fingerprint:{system_fingerprint}\n")
    print(f"prompt_tokens:{prompt_tokens}\n")
    print(f"completion_tokens:{completion_tokens}\n")
    print("---------\n")
    return response_content


def compare_responses(previous_response: str, response: str):
    diff = difflib.Differ().compare(previous_response.splitlines(), response.splitlines())
    print('\n'.join(diff), end="")


def main():
    topic = "a happy journey to Mars"
    system_message = "You are a helpful assistant that generates short stories."
    user_request = f"Generate a short story about {topic}."

    seed = 12345

    response1 = get_chat_response(
        system_message=system_message, user_request=user_request, seed=seed,
    )

    response2 = get_chat_response(
        system_message=system_message, user_request=user_request, seed=seed,
    )

    compare_responses(response1, response2)


main()

5 Likes

I can confirm that not only do seeds not work, setting the temperature to 0 isn’t producing deterministic results either, so there may be a deeper issue affecting generations.

3 Likes

according to the doc, the seed param is designed for
Reproducible outputs
, it seems that it is not working as the doc says

3 Likes

I have the same understanding as you.
I opened an issue on their Python SDK repository (openai-python/issues/708) although the issue is on the server side.

2 Likes

I just checked their cookbook “deterministic_outputs_with_the_seed_parameter” again and it is mentioned that

If the seed, request parameters, and system_fingerprint all match across your requests, then model outputs will mostly be identical. There is a small chance that responses differ even when request parameters and system_fingerprint match, due to the inherent non-determinism of computers.

I guess the behavior is expected then :man_shrugging:

1 Like

I will admit, I am somewhat confused by the “non-determinism of computers” part.

2 Likes

After fiddling around with it I was only able to get the seed parameter to work with a single model: gpt-3.5-turbo-1106

Seeds do not appear to have any effect on any other new or old models.

Additionally, gpt-4-1106-preview will not behave deterministically even with the temperature set to 0, which definitely seems to be a bug.

2 Likes

hi ,can you share more information about how to ask in the Lounge?

I try gpt-3.5-turbo-1106 for serveral times and can not get the totally same result. (the system_fingerprint are all the same). In my testing coding, the result come out that some parts like the first or maybe the second line are the same, but the rest of the result are different.

The Lounge is a category that is accessible once you reach Trust Level 3.

The default requirements are listed here

Why you are redirecting people to Discourse instead of answering here?

I am not redirecting people, only noting to Foxabilo that he should ask in the Lounge category, which he did.

1 Like

Agree, it definitely doesn’t work. After many tries gpt-3.5-turbo-1106 produces different results on each call.

I’ve figured out that issue with not reproducing same output with defined seed is related with calculating question embeddings which tend to produce varying results. So for testing purposes to achieve comparable results you need to retrieve once and then reuse (mock) embeddings for defined set of test questions. More about this issue you can find in thread below:

1 Like

Seed still does not guarantee the same answer.
If you look at the guide, it says to do our best.
Is it because it’s still a beta service?
Are there any plans for further improvements?

2 Likes

Agree, results still tend to fluctuate. I run integration tests regularly to find that time when result becomes reproducible. Not yet…

1 Like

In my experience, the seeds “do work by chance”. Like out of 5 requests I get 3 of them identical (not to mention the temperature is low 0.2 - but that should not matter when you set the seed) and 2 get a different result.

I think this seed feature still needs improvement/fixes/patches.

1 Like

any updates here? this is still an issue in Feb 2024.

1 Like

+1

I interpreted the “(mostly) deterministic” responses announced at https://platform.openai.com/docs/guides/text-generation/reproducible-outputs to mean it would give the same response except when the fingerprint changes.

However, at least in all my use cases, the responses are still nondeterministic. Although the fingerprints match, the responses diverge after the first few words. So setting a seed value is useless for unit tests and reproducing bugs, because the deterministic part simply doesn’t work.

If anyone finds a way to make the openai responses fully deterministic, please let us know!

When you send prompt + seed to API, the OpenAI may not be computing the answer with the same system as previously. This is represented by a variable system_fingerprint. Different system_fingerprint means a different system was used to calculate it, which means you might get a different completion.

e.g.

prompt1 + seed1 -> system_fingerprint1 -> completion1
prompt1 + seed1 -> system_fingerprint1 -> completion1
prompt1 + seed1 -> system_fingerprint2 -> completion2
1 Like