How can i reproduce chat completions?

oran.sh · December 11, 2023, 12:47pm

Hello,

How can I make sure I am able to reproduce exact same answer for the same input prompt?

Here’s my settings:
params = {
“model”: “gpt-4-1106-preview”,
“messages”: PROMPT_MESSAGES,
“max_tokens”: 200,
“seed”: 33,
“temperature”: 0,
}

PROMPT_MESSAGES remains the same.
In this setting I am getting different results when running several times.
What am I missing here?
Thanks!

trenton.dambrowitz · December 11, 2023, 12:53pm

What output are you getting? Is it structured JSON, a sentence, etc.

TonyAIChamp · December 11, 2023, 12:55pm

seed parameter should be able to do the magic

.

oran.sh · December 11, 2023, 12:55pm

Just a regular sentence.
For example, can try with following input prompt:
“How did WW2 start?”
I will get two different answers for this configuration.

oran.sh · December 11, 2023, 12:55pm

Seed is fixed and still diffferent output every time.

TonyAIChamp · December 11, 2023, 12:57pm

Could you give an example of prompt messages where with the same seed you get different results on different executions?

oran.sh · December 11, 2023, 1:27pm

PROMPT_MESSAGES = [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Describe how WW2 started."
        }
      ]
    }
  ]

params = {
    "model": "gpt-4-1106-preview",
    "messages": PROMPT_MESSAGES,
    "max_tokens": 200,
    "seed": 33,
    "temperature": 0,
}

result = client.chat.completions.create(**params)
print(result.choices[0].message.content)

_j · December 11, 2023, 1:33pm

Your single parameter for all determinism the AI can offer is top-p = 0.000001 (1e-9 works just as well)

Even if every one of the 100000 tokens had near equal probability, the prob mass can only include the one that comes first. Random is irrelevant.

Then it is up to the AI model to work deterministically (which it doesn’t after gpt-3 models)

oran.sh · December 11, 2023, 1:43pm

It’s very close with top_p=1e-9, but not exactly the same. You’re saying that gpt4 has randomness inherent in the model itself?

_j · December 11, 2023, 1:51pm

Yes, what’s happening before the generation of token probabilities is unreliable calculations. Your first token of a story might be "The"=33.55% on one run, and "The"=33.21% on another, and with them bouncing around in successive generations, even with greedy sampling, symptoms manifest such as the second-ranked token “A”=33.33% (+/- x%) becoming the first place and thus selected.

This is exactly seen in the one 3.5 model we get logprobs from. The symptom is seen in the rest.

TonyAIChamp · December 12, 2023, 1:31am

I did some experimenting: prompt_engineering_experiments/experiments/DeterministicResultsOpenAI/Deterministic Results in OpenAI (report).ipynb at main · TonySimonovsky/prompt_engineering_experiments · GitHub

4 versions using seed/top_p and 50/200 token limit. 100 completions for each of the variation.

More variations of the experiment (both in terms of models and input data) are required to make definitive conclusion.

From the tests done so far we can see that seed parameter gives more stable results that low top_p and the longer the completion, the higher variability.

Interestingly, adding very low top_p increased variability comparing to not having it.

christian.weyer · January 26, 2024, 8:33am

Nice tests! Are you going to re-run them with the latest API version announced yesterday?

Topic		Replies	Views
Finetuned gpt4o is not deterministic API	4	93	December 20, 2024
The seed inference parameter for reproducibility API	5	6579	December 13, 2023
Deterministic Results Impossible for GPT-4o API gpt-4 , chat-completion , api-temperature , seed	6	433	December 19, 2024
How to do repeatable testing for ChatGPT prompts? API chatgpt	1	2171	November 28, 2023
ChatCompletions are not deterministic even with seed set, temperature=0, top_p=0, n=1 API gpt-4 , api	9	1436	October 7, 2024

How can i reproduce chat completions?

Related topics