GPT-4, likely being a mixture of expert models and a synthesis of their results, acts a bit differently than gpt-3xx. One interpretation could be that temperature is scaled differently within specializations, or that there are components that still include multinomial selection.
The assertion that a tool is broken even at very low temperature is specious.
Let’s play with probabilities:
params = {"model": "gpt-4", "max_tokens": 1, "n": 60,
"temperature": 0.2, "top_p": 0.99,
"messages": [{"role": "system",
"content": """Allowed output: only one word, a random choice of 'heads' or 'tails'.
Flip a virtual coin with equal outcome probability."""}]}
api = openai.ChatCompletion.create(**params)
flips = ''.join(choice["message"]["content"][0] for choice in api["choices"])
print(flips)
Sixty identical runs of gpt-4, with a result far more uncertain than how to write code.
We see the h=heads t=tails results:
ttttttttttttttttttthtttttttttttthtttthtththttttttttttttthttt
(it is actually quite hard to prompt equal probabilities without receiving back token logits)
I’m going to crank temperature to 2, but top_p to 0
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
Deterministic coin flips.