Why the API output is inconsistent even after the temperature is set to 0

It took a good amount of futzing around, and the prompt is only happenstance from other stuff I was trying, but I have an interesting result.

If you really want to have fun with statistics, do trials on two top logit token outputs that match to 8 digits of accuracy!

"top_logprobs": [
 {
  " Aug": -2.4173014,
  " Oct": -2.4173014,
  " Mar": -2.440739,
  " Jan": -2.440739
 }
]
  • Aug = 8.92%
  • Oct = 8.92%
  • Jan = 8.71%
  • Mar = 8.71%

model: davinci-002
max_tokens: 1

"prompt": """In the square brackets are 1000 random ASCII characters, using 0-9a-zA-Z: [0-9a-zA-Z]{1000}.

share|improve this answer

edited"""

Let’s run 70 trials at multiple settings. Extract the first letter each time.

“top_p”: 0.0891, temperature 2
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO

“top_p”: 0.0892, temperature 2
OOOAAOAAAOOOAAAAOOAOAAOOAOOAOAOOAOAAOAAOOOOAAAOAAAOAAOOOAAAAOOOOAAOOAO

Thus, an exact top_p threshold where the next token is allowed.

Let’s continue:

“top_p”: 0.0892, temperature=0.000000001 (very A)
OAAAAAAAAAAAAOAAAAOAAAAAAAAAAAAAAAOAAOAAOAOAAAAAAAAAOOAAAAOOOOAAOAOAAA

“top_p”: 0.0892, temperature=0.0000000001 (all A)
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

And you won’t believe if we switch from miniscule to 0, a change:

First letter results of “top_p”: 0.0892, temperature=0.0
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO

Or even if we release the top_p restriction, a change again:

First letter results of “top_p”: 1.0, temperature=0.0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

The odd thing is temperature limit or top-p limit methods converge on a different token of the two allowed depending on setting.

Are they literally tied as far as top_p is concerned so the first seen is picked, while temperature is able to put distance between the probabilities?

2 Likes