Seed param and reproducible output do not work

dongavel_us · November 9, 2023, 8:30am

Hi ,
I have writen the code according to the doc to check if I can get the same output by setting the “seed” param. but it seems the output still diff form requests. both the “gpt-4-1106-preview” model and “gpt-3.5-turbo” give unreproducible result in case of setting all the same input and seed.
Am I misunderstand the seed param usage ?

from openai import OpenAI
import difflib

# GPT_MODEL = "gpt-4-1106-preview"
GPT_MODEL = "gpt-3.5-turbo"
client = OpenAI(api_key='■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■U47cWLiN')


def get_chat_response(system_message: str, user_request: str, seed: int = None):
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_request},
    ]

    response = client.chat.completions.create(
        model=GPT_MODEL,
        messages=messages,
        seed=seed,
        temperature=0.7
    )

    # print(response)

    response_content = response.choices[0].message.content
    system_fingerprint = response.system_fingerprint
    prompt_tokens = response.usage.prompt_tokens
    completion_tokens = (
            response.usage.total_tokens - response.usage.prompt_tokens
    )

    print(response_content+"\n")
    print(f"system_fingerprint:{system_fingerprint}\n")
    print(f"prompt_tokens:{prompt_tokens}\n")
    print(f"completion_tokens:{completion_tokens}\n")
    print("---------\n")
    return response_content


def compare_responses(previous_response: str, response: str):
    diff = difflib.Differ().compare(previous_response.splitlines(), response.splitlines())
    print('\n'.join(diff), end="")


def main():
    topic = "a happy journey to Mars"
    system_message = "You are a helpful assistant that generates short stories."
    user_request = f"Generate a short story about {topic}."

    seed = 12345

    response1 = get_chat_response(
        system_message=system_message, user_request=user_request, seed=seed,
    )

    response2 = get_chat_response(
        system_message=system_message, user_request=user_request, seed=seed,
    )

    compare_responses(response1, response2)


main()

semlar · November 9, 2023, 9:23am

I can confirm that not only do seeds not work, setting the temperature to 0 isn’t producing deterministic results either, so there may be a deeper issue affecting generations.

dongavel_us · November 9, 2023, 9:50am

according to the doc, the seed param is designed for
Reproducible outputs
, it seems that it is not working as the doc says

ycrouin · November 10, 2023, 9:34am

I have the same understanding as you.
I opened an issue on their Python SDK repository (openai-python/issues/708) although the issue is on the server side.

ycrouin · November 10, 2023, 10:43am

I just checked their cookbook “deterministic_outputs_with_the_seed_parameter” again and it is mentioned that

If the seed, request parameters, and system_fingerprint all match across your requests, then model outputs will mostly be identical. There is a small chance that responses differ even when request parameters and system_fingerprint match, due to the inherent non-determinism of computers.

I guess the behavior is expected then

Foxalabs · November 10, 2023, 11:06am

I will admit, I am somewhat confused by the “non-determinism of computers” part.

semlar · November 10, 2023, 8:36pm

After fiddling around with it I was only able to get the seed parameter to work with a single model: gpt-3.5-turbo-1106

Seeds do not appear to have any effect on any other new or old models.

Additionally, gpt-4-1106-preview will not behave deterministically even with the temperature set to 0, which definitely seems to be a bug.

dongavel_us · November 12, 2023, 2:42pm

hi ，can you share more information about how to ask in the Lounge?

dongavel_us · November 12, 2023, 2:47pm

I try gpt-3.5-turbo-1106 for serveral times and can not get the totally same result. (the system_fingerprint are all the same). In my testing coding, the result come out that some parts like the first or maybe the second line are the same, but the rest of the result are different.

EricGT · November 14, 2023, 11:24am

The Lounge is a category that is accessible once you reach Trust Level 3.

The default requirements are listed here

egils · November 15, 2023, 12:39pm

Why you are redirecting people to Discourse instead of answering here?

EricGT · November 15, 2023, 12:43pm

I am not redirecting people, only noting to Foxabilo that he should ask in the Lounge category, which he did.

vseledkin · November 21, 2023, 10:36am

Agree, it definitely doesn’t work. After many tries gpt-3.5-turbo-1106 produces different results on each call.

egils · November 27, 2023, 10:07am

I’ve figured out that issue with not reproducing same output with defined seed is related with calculating question embeddings which tend to produce varying results. So for testing purposes to achieve comparable results you need to retrieve once and then reuse (mock) embeddings for defined set of test questions. More about this issue you can find in thread below:

yhdonghoon · December 13, 2023, 6:18am

Seed still does not guarantee the same answer.
If you look at the guide, it says to do our best.
Is it because it’s still a beta service?
Are there any plans for further improvements?

egils · December 19, 2023, 4:52pm

Agree, results still tend to fluctuate. I run integration tests regularly to find that time when result becomes reproducible. Not yet…

aayush_shah · January 9, 2024, 12:55pm

In my experience, the seeds “do work by chance”. Like out of 5 requests I get 3 of them identical (not to mention the temperature is low 0.2 - but that should not matter when you set the seed) and 2 get a different result.

I think this seed feature still needs improvement/fixes/patches.

ethanjyx · February 8, 2024, 11:55pm

any updates here? this is still an issue in Feb 2024.

ForrestT · February 13, 2024, 12:25am

+1

I interpreted the “(mostly) deterministic” responses announced at https://platform.openai.com/docs/guides/text-generation/reproducible-outputs to mean it would give the same response except when the fingerprint changes.

However, at least in all my use cases, the responses are still nondeterministic. Although the fingerprints match, the responses diverge after the first few words. So setting a seed value is useless for unit tests and reproducing bugs, because the deterministic part simply doesn’t work.

If anyone finds a way to make the openai responses fully deterministic, please let us know!

Elijas · March 10, 2024, 2:20pm

When you send prompt + seed to API, the OpenAI may not be computing the answer with the same system as previously. This is represented by a variable system_fingerprint. Different system_fingerprint means a different system was used to calculate it, which means you might get a different completion.

e.g.

prompt1 + seed1 -> system_fingerprint1 -> completion1
prompt1 + seed1 -> system_fingerprint1 -> completion1
prompt1 + seed1 -> system_fingerprint2 -> completion2

Topic		Replies	Views
ChatCompletions are not deterministic even with seed set, temperature=0, top_p=0, n=1 API gpt-4 , api	9	2019	October 7, 2024
Why the API output is inconsistent even after the temperature is set to 0 API gpt-4	11	25582	December 21, 2023
AI model fingerprints are not unique, making them fairly useless for tracking model updates API	15	2990	May 22, 2024
The "seed" option for GPT does not increase the determinism level API gpt-35-turbo	5	11877	December 13, 2023
How can i reproduce chat completions? API gpt-4	11	3928	January 26, 2024

Seed param and reproducible output do not work

Related topics