All AI models that OpenAI currently runs are non-deterministic. You give them the same input, they return different logprob values each time.
This variation is even higher in the newest models. You can specify either 0 or miniscule small values for top_p and temperature, and get answers that diverge pretty quickly.
Temperature 0 is not as good as temperature 0.0000000001 for some reason. top_p at an extremely low number is a stronger enforcer of only getting the top value back.
The answer overall though is perplexity. The less expensive AI is less clear and certain how to score tokens (unless it is a particular post-training chat behavior), and so the values of logprobs end up being closer together and easy for one to overtake another between runs.
Asking for logprobs doesn’t change the behavior. It does let you see how close “Yes” was to “No”, though, or to “yes” or “I’m sorry”. That can give you insight that you need to make your prompting and desired output clearer, or that the AI just has no good truthfulness score for you from the facts.