Non deterministic API/Playground GPT-4 responses breaks LangChain ReAct implementation

GPT-4, likely being a mixture of expert models and a synthesis of their results, acts a bit differently than gpt-3xx. One interpretation could be that temperature is scaled differently within specializations, or that there are components that still include multinomial selection.

The assertion that a tool is broken even at very low temperature is specious.

Let’s play with probabilities:

params = {"model": "gpt-4", "max_tokens": 1, "n": 60,
    "temperature": 0.2, "top_p": 0.99,
    "messages": [{"role": "system",
    "content": """Allowed output: only one word, a random choice of 'heads' or 'tails'.
Flip a virtual coin with equal outcome probability."""}]}
api = openai.ChatCompletion.create(**params)
flips = ''.join(choice["message"]["content"][0] for choice in api["choices"])
print(flips)


Sixty identical runs of gpt-4, with a result far more uncertain than how to write code.

We see the h=heads t=tails results:
ttttttttttttttttttthtttttttttttthtttthtththttttttttttttthttt

(it is actually quite hard to prompt equal probabilities without receiving back token logits)


I’m going to crank temperature to 2, but top_p to 0
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt

Deterministic coin flips.

2 Likes