Seed param and reproducible output do not work

That is not what the fingerprint is for.

It indicates an AI model subversion or revision.

For applications that require highest deterministic output, being notified that that OpenAI has (otherwise stealthfully) updated the model and the type of output that may be generated is useful information to capture.

That is not what the fingerprint is for.

Thanks for the quick reply! Let me express myself a little clearer.

From official OpenAI docs:

system_fingerprint This fingerprint represents the backend configuration that the model runs with. Can be used in conjunction with the seed request parameter to understand when backend changes have been made that might impact determinism.

I did not mean system_fingerprint as a “seed” value itself that’s fed into the model, but it’s a rather a “metaphorical hash” of the backend system configuration itself

i.e. We can express generation as:

prompt + seed → backend → completion

And from the above follows that different backend (expressed as a different system_fingerprint value) may lead to different completion, given the same prompt and seed pair as input

2 Likes

Also, there is very likely some amount of randomness coming out of just simple floating point calculation race conditions

when completion calculation is parallelized, some computations could potentially be made quicker and lead to tiny (~0.00…001) errors (due to finite floating point precision).

While most of the time it won’t matter, tiny fraction of cases a different token would be selected, which in turn could potentially affect all of the rest tokens due to how transformer architecture works.

Although I’m not sure how prevalent (if at all) this is

GPT-3 models were deterministic. You put in the same input, you get exactly the same embeddings and exactly the same logit values and logprobs every time. So it is not a “transformer architecture” issue.

Math is math, and barring computational error in the processor, those bits get combined in the same way every time regardless of how complex the underlying processes are.

All OpenAI models now available are indeed non-deterministic. We don’t know why. Did they turn off ECC in the GPU for efficiency? A non homogeneous mix of hardware pool. Do they purposely “selective availability” the outputs so that you can’t make stateful inspections of the underlying mechanisms? Whatever it is, you run 20 of the same embeddings or 20 of the same chat completions, you get different vectors and different logprobs almost every time, often resulting in position-switching of ranked tokens and ranked semantic search.

That fingerprint changing indicates you’re going to get different results - they added training, reweighting, or inference architecture changes, so it is essentially like pointing your job at a different model, with no changelog.

The seed is part of the sampling that comes after logit calculation and softmax, which is meant to be random. You can ask the AI to roll 1d20 at temperature 1.5, and every call gets you different results because of the random token selection from all possible. Set the seed the same and you’d always get the same result back - except for the previously described issue that reduces the quality of the token mass that is an input to the sampler.

4 Likes

You explained it well, thank you! The 1d20 example makes sense.

4 Likes

setting the temperature to 0 for the model gpt-4o-mini-2024-07-18 works now?

Temperature of 0 is a divide-by-zero, so it is actually just a very low temperature placeholder if sent.

Trial 1, gpt-4o-mini:

USER

yes or no: Is a cashew apple actually a berry?

Enums

Token /Bias: {‘yes’: 0, ‘no’: 0}
Token#/Bias: {6763: 0, 1750: 0}
response_token[1750]

Response

RESPONSE content: {“answer”:“no”}
RESPONSE token number(s): [1750]

Logprobs (to probability):
Token: “no”
Probability: 81.757379736705829%

Top Logprobs:
Token: “no”
Probability: 81.757379736705829%

Token: “yes”
Probability: 18.242537233967166%

Token: “Yes”
Probability: 0.000028339784657%

Token: " yes"
Probability: 0.000028339784657%

Trial 2:

USER

yes or no: Is a cashew apple actually a berry?

Enums

Token /Bias: {‘yes’: 0, ‘no’: 0}
Token#/Bias: {6763: 0, 1750: 0}
response_token[1750]

Response

RESPONSE content: {“answer”:“no”}
RESPONSE token number(s): [1750]

Logprobs:
Token: “no”
Probability: 73.105753266954238%

Top Logprobs:
Token: “no”
Probability: 73.105753266954238%

Token: “yes”
Probability: 26.894102055251025%

Token: “Yes”
Probability: 0.000047342944659%

Token: " yes"
Probability: 0.000047342944659%


Conclusion: with token probabilities changing from 81% to 73% between calls, temperature is irrelevant for obtaining deterministic outputs. The underlying model is not deterministic.

What temperature would do in that case is instead of around 20 percent of the trials giving a different answer, you’d get the top answer with much higher probability - unless the logprobs vary, which they do.

Find a question where between calls you get 45% and 55% for “yes”, or just that kind of uncertainty throughout generation, you’ll get rank-flipping and a different answer.

Did you set both temperature and seed?

That’s a good question: here’s why seed doesn’t improve the situation.

If you want greedy sampling, guaranteed, where only the “best” token with lowest perplexity is chosen, you’d use a top_p: 0 - or essentially 0.000005 or lower serves the same role. Regardless of how tokens are generated, the top-rank token cannot have under that normalized value. The top token will always be returned.

However, with a non-deterministic underlying model, in the sorted generation, you can have two top tokens of similar value, that in successive runs, might return:

Trial one: “Sure”: 6%, “Okay”: 4%
Trial two: “Okay”: 5%, “Sure”: 4.5%

They are switching places with a 6% changing to 4.5% and the 5% at the “top” now assigned to what was the second-place token in the previous generation. That is the issue. The certainties are moving around on you, and the model is generating different values each time.

A seed is used when you do allow the randomness of a non-zero top_p or temperature. It is supposed to allow repeatability, even when you allow the “creative” random sampling to be based on the distribution that was generated by the model without tweaks.


Seed: So take the same token values I just show (where the remaining 90% means many other things like “Certainly” potentially sampled from and written.) A seed would ensure the random generator has the same output number – used for token selection.

From a value 0.0 to 1.0 within the distribution, seed re-use might give you 0.04 every time, and in a deterministic token dictionary from the same input, would always give you a repeated repeatable output.

In this case, however, picking from the sorted ranked list of tokens, the same point of selection now has a different token demonstrated between each call if we “pick” what appears at 0.04. There’s thousands of tokens slightly taking up more or less distribution space each time, so regardless if you sample at the same “point” by using seed, what you get can be different, at every token. Then it only takes one token flip for the rest of the answer to diverge.

Setting a seed with temperature=0.0 consistently gives repeatable outputs?


from litellm import completion

response = completion(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "tell a random number with 4 decimal digits"}
    ],
    seed = 42,
    temperature = 0.0,
)

print(response.choices[0].message.content, response.system_fingerprint)

# With Seed
# Sure! Here’s a random number with four decimal digits: 7.4823. fp_34a54ae93c
# Sure! Here’s a random number with four decimal digits: 7.4823. fp_34a54ae93c

# Without Seed
# Sure! Here’s a random number with four decimal digits: 7.4823. fp_34a54ae93c
# Sure! Here’s a random number with four decimal digits: **3.5821**. fp_34a54ae93c

Nope:

# --- API Parameters (except seed) ---
api_base_parameters = {
    "messages": [system_message, user_message],
    "model": model,  # model = "gpt-4o-mini-2024-07-18"
    "max_completion_tokens": 30,
    "top_p": 1,
    "temperature": 0,
    "logprobs": True,
    "top_logprobs": 4,
    "logit_bias": {},  # No bias
}

Then trying my ambiguous question against:

seed = random.randint(0, 2**31 - 1)

And running batches of trials of the same seed, trial run of batch 6:

*** NON-DETERMINISTIC FAULT DETECTED ***
Seed: 1346547828

— Call 1 (Trial 1) —
Answer: no
Token: “no”
Probability: 81.757371208747003%
Top Logprobs:
Token: “no”
Probability: 81.757371208747003%
Token: “yes”
Probability: 18.242535059287391%
Token: " yes"
Probability: 0.000032113183144%
Full JSON response: {“answer”:“no”}

— Call 2 (Trial 2) —
Answer: yes
Token: “yes”
Probability: 62.245838378760276%
Top Logprobs:
Token: “yes”
Probability: 62.245838378760276%
Token: “no”
Probability: 37.754008291078286%
Token: " yes"
Probability: 0.000066460153032%
Full JSON response: {“answer”:“yes”}

Replication instructions:
Use seed=1346547828 and the same prompt/model to attempt to reproduce this result.

"no" went from 81% to 38% probability with the same input between runs, with temperature=0 and a seed. Reusing the same seed sample “point” cannot help.