Logprobs and message.content are inconsistent

I am using gpt-4-turbo-2024-04-09 via the API with a prompt that asks the model to respond with either 0 or 1. My API calls include these arguments:

'max_tokens': 1,
'n': 3,
'logprobs': True,
'top_logprobs': 2

99.5% of the time, message.content (the model output) is identical to the top token by logprob (e.g., model output is 1 and top token by logprob is 1).

However, 0.5% of the time, this isn’t the case (e.g., model output is 1 and top token by logprob is 0).

Is this discrepancy/difference ever expected or is this likely a bug? :pray:

1 Like

The AI uses random sampling that is based on probabilities. This makes language more diverse, more human…more error-prone.

80% certainty means 20% chance something else is written.

If you want the top logprob as language output, you have to reduce the probability mass from which sampling can take place. Another API parameter:

'top_p': 0.0001
1 Like

Thanks for the explanation! The API docs explain how to set top_p with this example: “0.1 means only the tokens comprising the top 10% probability mass are considered.”

Do you know what this means with regards to always guaranteeing that I get at least one result? If I set top_p too small (e.g., 0.0001), do I run the risk of not getting back any top tokens?

That is a good question. The top_p is inclusive. It would only “return nothing” if you could send a text value of a float that was smaller than the smallest magnitude that can be represented in whatever math is being done behind the scenes by the endpoint. Setting it to 0 actually has a short circuit to a value like 0.01 instead of the smallest number you can send yourself, like 1e-19.

Inclusive means if the top probabilities were {“hello”: 25%, “hi”: 20%}, a top_p:0.26 would include both tokens for sampling at their representative weights.

Since the dictionary is limited to 100k tokens, there is a minimum normalized softmax probability value that a top token can have.

Also, you can be mindful that the models themselves are no longer deterministic. The logprob value can vary between runs, to the point where similar score logits can switch rank.

1 Like

Thank you for that explanation!

One last question on this topic. I set the completion choices to 3 (i.e., n=3) and I’m finding the message.content (the actual output from the API) varies across these three choices (e.g., 0 for the first and 1 for the second).

Which makes sense.

But when I look at their top_logprobs, they’re identical (e.g., 79% for 0 and 21% for 1, for both the first and second choices).

Does this mean that what varies between choices is which token is sampled as the output, rather than the computed token probabilities?

Let’s talk about the the output of a language model in overall terms, in the order of operations performed to generate a single n+1 token based on the current input context.

Here’s a convoluted abstraction I just typed up, from lots of probing and some deterministic trials that can no longer be meaningfully done on current models. Some of these interjections into the process need more flowchart branches than I depict.


context_&_AI_pretraining --> embeddings --> Language_inference
context_&_AI_pretraining --> hidden_state--> Language_inference
context_&_AI_pretraining -->json_mode --> Language_inference
context_&_AI_pretraining -->run_supervision... --> Language_inference
Language_inference -- logits_dictionary --> logit_bias -- logits_dictionary --> softmax_production

softmax_production --> top_p--truncated_mass--> softmax_production
softmax_production -- logprobs --> temperature -- dictionary --> multinomial_sampler
multinomial_sampler -- token --> content_filter
run_supervision... -->  content_filter --"API filtering and containerization" --> output
Language_inference -- logits_dictionary -->  softmax_alt -- bias_ignored -->logprobs_return--"tokenstobytes"-->output

I don’t depict that the token is added to the context, and the generation repeats until interruption.

Each token is randomly selected: it is sampled.

Without alterations by top_p or temperature (using the default values of 1 for each) means that the AI certainty is directly translated into the chance that the token will be randomly picked as the output. Like a token lottery.

“Hello” might be a 75% certain response. If input is English, “こんにちは” might be 7.2e-8 certain, maybe to show up in one-in-a-billion generations, but still under consideration.

In your case, you have a “0” token at 79%. Run a million trials without altering the sampling parameters, and 79% of the trials will show “0”.

Without restraining the output, your AI decision-making on the input you supplied is more akin to a biased coin flip.


Thank you for that comprehensive explanation! So, to conclude, if I set my choices to a million (n=1,000,000), 79% of the choices will output the token 0, but the logprob of each of those choices for the token 0 will always correspond to 79%.