Hi ,
I have writen the code according to the doc to check if I can get the same output by setting the “seed” param. but it seems the output still diff form requests. both the “gpt-4-1106-preview” model and “gpt-3.5-turbo” give unreproducible result in case of setting all the same input and seed.
Am I misunderstand the seed param usage ?
I can confirm that not only do seeds not work, setting the temperature to 0 isn’t producing deterministic results either, so there may be a deeper issue affecting generations.
I have the same understanding as you.
I opened an issue on their Python SDK repository (openai-python/issues/708) although the issue is on the server side.
I just checked their cookbook “deterministic_outputs_with_the_seed_parameter” again and it is mentioned that
If the seed, request parameters, and system_fingerprint all match across your requests, then model outputs will mostly be identical. There is a small chance that responses differ even when request parameters and system_fingerprint match, due to the inherent non-determinism of computers.
I try gpt-3.5-turbo-1106 for serveral times and can not get the totally same result. (the system_fingerprint are all the same). In my testing coding, the result come out that some parts like the first or maybe the second line are the same, but the rest of the result are different.
I’ve figured out that issue with not reproducing same output with defined seed is related with calculating question embeddings which tend to produce varying results. So for testing purposes to achieve comparable results you need to retrieve once and then reuse (mock) embeddings for defined set of test questions. More about this issue you can find in thread below:
Seed still does not guarantee the same answer.
If you look at the guide, it says to do our best.
Is it because it’s still a beta service?
Are there any plans for further improvements?
In my experience, the seeds “do work by chance”. Like out of 5 requests I get 3 of them identical (not to mention the temperature is low 0.2 - but that should not matter when you set the seed) and 2 get a different result.
I think this seed feature still needs improvement/fixes/patches.
However, at least in all my use cases, the responses are still nondeterministic. Although the fingerprints match, the responses diverge after the first few words. So setting a seed value is useless for unit tests and reproducing bugs, because the deterministic part simply doesn’t work.
If anyone finds a way to make the openai responses fully deterministic, please let us know!
When you send prompt + seed to API, the OpenAI may not be computing the answer with the same system as previously. This is represented by a variable system_fingerprint. Different system_fingerprint means a different system was used to calculate it, which means you might get a different completion.