I’m noticing that when I run the same prompt multiple times with GPT-4o-2024-08-06, the outputs aren’t always identical, even though I’ve set the temperature to 0. Is there anything more I can do to make the output fully consistent from run to run?
You can also set top_p to zero, but it likely wouldn’t make that much of a difference.
The outputs aren’t guaranteed to be deterministic. The best way to get “deterministic” outputs (on top of disabling the sampler with temp and top_p = 0) is to adjust your prompt such that a certain output is all but guaranteed.
e.g.:
What do you call a group of fish?
A group of fish is typically called a school or a shoal.
School refers to a group of fish swimming in a coordinated, synchronized way, usually of the same species, moving together for protection or to appear larger to predators.
Shoal is a more general term that applies to a group of fish swimming together loosely, without necessarily moving in the same direction or coordinating closely.
vs.:
What do you call a group of fish? Respond only with one word, no quotes, no period.
School
I know it’s not always possible, but if you can structure your prompts you can achieve a decent level of determinism even across models and vendors of similar capability.
Getting consistent results between every run isn’t really what AI is about, but it shows you the divergence from the algorithmic best rank that is going on underneath in any case.
The other thing you can do besides setting top_p to zero or extremely small, and sending a fixed seed value (which after top_p giving always the top, should then have no effect) is to look at the system_fingerprint being returned, which is also supposed to point out model differences or architecture differences that affect determinism.
Running 30 trials just now of 2024-08-06, and mixing in prediction vs without, there’s a bunch, that would make it costly to discard all but one for analysis: