I don’t work for AI and didn’t program models. We can only extract evidence.
The randomness is obviously not reset to a fixed state between API calls, as this would defeat the purpose of diversity sampling for varied answering, and only ensure the same low-perplexity response each time.
You can replicate this prompt and the next token that it produces (the 46th context element) as one example to probe a case where two top probabilities are almost identical:
More:
The latter gives you sample chat endpoint code and gpt-4 results, and we also have the new gpt-3.5-turbo-instruct completion model with raw input that returns logprobs to experiment with.