How can i reproduce chat completions?

Hello,

How can I make sure I am able to reproduce exact same answer for the same input prompt?

Here’s my settings:
params = {
“model”: “gpt-4-1106-preview”,
“messages”: PROMPT_MESSAGES,
“max_tokens”: 200,
“seed”: 33,
“temperature”: 0,
}

PROMPT_MESSAGES remains the same.
In this setting I am getting different results when running several times.
What am I missing here?
Thanks!

1 Like

What output are you getting? Is it structured JSON, a sentence, etc.

seed parameter should be able to do the magic

.

Just a regular sentence.
For example, can try with following input prompt:
“How did WW2 start?”
I will get two different answers for this configuration.

Seed is fixed and still diffferent output every time.

Could you give an example of prompt messages where with the same seed you get different results on different executions?

1 Like
PROMPT_MESSAGES = [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Describe how WW2 started."
        }
      ]
    }
  ]

params = {
    "model": "gpt-4-1106-preview",
    "messages": PROMPT_MESSAGES,
    "max_tokens": 200,
    "seed": 33,
    "temperature": 0,
}

result = client.chat.completions.create(**params)
print(result.choices[0].message.content)

Your single parameter for all determinism the AI can offer is top-p = 0.000001 (1e-9 works just as well)

Even if every one of the 100000 tokens had near equal probability, the prob mass can only include the one that comes first. Random is irrelevant.

Then it is up to the AI model to work deterministically (which it doesn’t after gpt-3 models)

It’s very close with top_p=1e-9, but not exactly the same. You’re saying that gpt4 has randomness inherent in the model itself?

Yes, what’s happening before the generation of token probabilities is unreliable calculations. Your first token of a story might be "The"=33.55% on one run, and "The"=33.21% on another, and with them bouncing around in successive generations, even with greedy sampling, symptoms manifest such as the second-ranked token “A”=33.33% (+/- x%) becoming the first place and thus selected.

This is exactly seen in the one 3.5 model we get logprobs from. The symptom is seen in the rest.

I did some experimenting: prompt_engineering_experiments/experiments/DeterministicResultsOpenAI/Deterministic Results in OpenAI (report).ipynb at main · TonySimonovsky/prompt_engineering_experiments · GitHub

4 versions using seed/top_p and 50/200 token limit. 100 completions for each of the variation.

More variations of the experiment (both in terms of models and input data) are required to make definitive conclusion.

From the tests done so far we can see that seed parameter gives more stable results that low top_p and the longer the completion, the higher variability.

Interestingly, adding very low top_p increased variability comparing to not having it.

4 Likes

Nice tests! Are you going to re-run them with the latest API version announced yesterday?