Logprobs and message.content are inconsistent

gyanveda · April 10, 2024, 6:23pm

I am using gpt-4-turbo-2024-04-09 via the API with a prompt that asks the model to respond with either 0 or 1. My API calls include these arguments:

'max_tokens': 1,
'n': 3,
'logprobs': True,
'top_logprobs': 2

99.5% of the time, message.content (the model output) is identical to the top token by logprob (e.g., model output is 1 and top token by logprob is 1).

However, 0.5% of the time, this isn’t the case (e.g., model output is 1 and top token by logprob is 0).

Is this discrepancy/difference ever expected or is this likely a bug?

_j · April 10, 2024, 7:11pm

The AI uses random sampling that is based on probabilities. This makes language more diverse, more human…more error-prone.

80% certainty means 20% chance something else is written.

If you want the top logprob as language output, you have to reduce the probability mass from which sampling can take place. Another API parameter:

'top_p': 0.0001

gyanveda · April 11, 2024, 1:48am

Thanks for the explanation! The API docs explain how to set top_p with this example: “0.1 means only the tokens comprising the top 10% probability mass are considered.”

Do you know what this means with regards to always guaranteeing that I get at least one result? If I set top_p too small (e.g., 0.0001), do I run the risk of not getting back any top tokens?

_j · April 11, 2024, 1:59am

That is a good question. The top_p is inclusive. It would only “return nothing” if you could send a text value of a float that was smaller than the smallest magnitude that can be represented in whatever math is being done behind the scenes by the endpoint. Setting it to 0 actually has a short circuit to a value like 0.01 instead of the smallest number you can send yourself, like 1e-19.

Inclusive means if the top probabilities were {“hello”: 25%, “hi”: 20%}, a top_p:0.26 would include both tokens for sampling at their representative weights.

Since the dictionary is limited to 100k tokens, there is a minimum normalized softmax probability value that a top token can have.

Also, you can be mindful that the models themselves are no longer deterministic. The logprob value can vary between runs, to the point where similar score logits can switch rank.

gyanveda · April 11, 2024, 2:42am

Thank you for that explanation!

One last question on this topic. I set the completion choices to 3 (i.e., n=3) and I’m finding the message.content (the actual output from the API) varies across these three choices (e.g., 0 for the first and 1 for the second).

Which makes sense.

But when I look at their top_logprobs, they’re identical (e.g., 79% for 0 and 21% for 1, for both the first and second choices).

Does this mean that what varies between choices is which token is sampled as the output, rather than the computed token probabilities?

_j · April 11, 2024, 4:29am

Let’s talk about the the output of a language model in overall terms, in the order of operations performed to generate a single n+1 token based on the current input context.

Here’s a convoluted abstraction I just typed up, from lots of probing and some deterministic trials that can no longer be meaningfully done on current models. Some of these interjections into the process need more flowchart branches than I depict.

    flowchart

context_&_AI_pretraining --> embeddings --> Language_inference
context_&_AI_pretraining --> hidden_state--> Language_inference
context_&_AI_pretraining -->json_mode --> Language_inference
context_&_AI_pretraining -->run_supervision... --> Language_inference
Language_inference -- logits_dictionary --> logit_bias -- logits_dictionary --> softmax_production

softmax_production --> top_p--truncated_mass--> softmax_production
softmax_production -- logprobs --> temperature -- dictionary --> multinomial_sampler
multinomial_sampler -- token --> content_filter
run_supervision... -->  content_filter --"API filtering and containerization" --> output
Language_inference -- logits_dictionary -->  softmax_alt -- bias_ignored -->logprobs_return--"tokenstobytes"-->output

I don’t depict that the token is added to the context, and the generation repeats until interruption.

Each token is randomly selected: it is sampled.

Without alterations by top_p or temperature (using the default values of 1 for each) means that the AI certainty is directly translated into the chance that the token will be randomly picked as the output. Like a token lottery.

“Hello” might be a 75% certain response. If input is English, “こんにちは” might be 7.2e-8 certain, maybe to show up in one-in-a-billion generations, but still under consideration.

In your case, you have a “0” token at 79%. Run a million trials without altering the sampling parameters, and 79% of the trials will show “0”.

Without restraining the output, your AI decision-making on the input you supplied is more akin to a biased coin flip.

gyanveda · April 11, 2024, 2:11pm

Thank you for that comprehensive explanation! So, to conclude, if I set my choices to a million (n=1,000,000), 79% of the choices will output the token 0, but the logprob of each of those choices for the token 0 will always correspond to 79%.

Topic		Replies	Views
Logprobs inconsistent between runs for 4o API logprobs	4	681	September 11, 2024
Non-deterministic probabilities for first generated token in chat.completion? API	4	784	April 24, 2024
Logprobs for specific tokens, not just top tokens API api	5	110	January 24, 2025
Surprising logprobs outputs for first token if it's '0' API logprobs	1	758	March 25, 2024
Logprobs keep changing when using the same prompt in chat.completion API api	3	1341	March 5, 2024

Logprobs and message.content are inconsistent

Related topics