I finetuned the gpt4o and I fixed a seed for the job seed=123. This is my code for the Chat Completions:
completion = client.chat.completions.create(
model=model,
temperature=0.0000001,
top_p=0.1,
seed=123,
messages=[
{…}]
I run the model multiple times but each time I receive a slight changes. What I understand is by setting the seed during the finetuning I can garentee the reproducibility but it was not the case! Any explanation ?
(un?)fortunately nothing is really deterministic anymore. Temperature and top_p 0 can get you pretty far, but it also depends on your prompt to a degree. If you engineer your prompt such that the answer is obvious, you’re more likely to get “deterministic” behavior. If you created the equivalent of a slots machine with your prompt, even turning off the sampler won’t save you
I use response variability between models when I can (not logprobs) as a measure of prompt stability. I measure final workflow output though (not CoT steps).
Well my task is a translation from source to target language in a gaming context. I design the system prompt to ensure the output provides high-quality translations that are accurate, contextually aware, and preserve the original tone. It also ensures cultural nuances are effectively adapted to suit the target audience.
Translation is naturally a quite uncertain task. Give the assignment to two different human translators, you’ll get divergence within a few words if not the very first. An AI model, however, can explain why statistically instead of by intuition:
A very useful tool not (yet? Please) available with OpenAI would be assistant response continuation and completion. For developing a particular work or developing training, a native speaker could spot those highlighted token positions with high uncertainty or close ranks (or just abnormalities in the writing itself that becomes awkward), edit manually or choose from token alternates, even those close enough to flip despite deterministic attempts, and sent the AI back on path from that point to still produce, in total time, far faster than one might write themselves.