How do logprobs work for chat completion API (for GPT-4.1)

I’ve been trying to understand how logprobs work in ChatGPT (GPT-4.1) and I’m running into some confusing results.

I asked the model to score a text between 1–10. The JSON output was:

{"score": 8}

But when I inspected the logprobs, and converted to logprobs via exp(log_prob) the probabilities were:

'7':        1.522997974471263e-08  
' eight':   9.056076989672867e-11  
'8':        9.545034922840628e-12  
' Eight':   7.433680672352188e-12  
' eighth':  1.8795288165390832e-12  
' acht':    1.2015425731771786e-13  
' вось':    8.25807328555592e-14  
' ocho':    1.626111044617819e-14  
'-eight':   4.658886145103398e-15  

A couple of things puzzle me:

  • 8 has a much lower probability than 7, yet the model almost always outputs 8 when I run this multiple times.

  • The distribution contains lots of variants of “8” in different forms/languages, which seems to split probability mass away from the plain '8'. It’s a bit odd that 8 has lower probability than 7, yet seems to be consistently sampled over 7 and it might imply the tokens are not independent in sampling

  • The true probabilities are tiny (10⁻⁸ and smaller) - yet 8 is consistently sampled. Is this expected?

So my questions are:

  1. Are these probabilities actually meaningful in this context?

  2. How does ChatGPT/GPT-4.1 sample from this distribution - is it pure softmax+temperature, or is there something else going on?

  3. Why would the token with the highest actual sample frequency (“8”) not align with the highest reported probability (“7”)?

Let’s see if the values are broken on gpt-4.1 as such results would indicate.

A boolean answer is solicited, and the logprob extracted from the position of a JSON output string value that is instructed:


Messages

SYSTEM


You are a binary classifier, answering every question with only Yes or No.
You are an expert at finding this best truthful boolean answer to any input question.
Regardless of the type of input or how inapplicable, you still must determine the best choice.

# Responses

You produce a JSON with key answer; the value of answer must be chosen from only enums:
['yes', 'no']

# Permitted JSON responses

## select one only from:

{"answer":"yes"}
{"answer":"no"}

USER

yes or no: Is a cashew apple actually a berry?

gpt-4o

Response

RESPONSE content: {“answer”:“no”}
RESPONSE token number(s): [1750]

Logprobs:
Token: “no”
Probability: 67.917288732095443%

Top Logprobs:
Token: “no”
Probability: 67.917288732095443%

Token: “yes”
Probability: 32.081855549896488%

Token: “Yes”
Probability: 0.000324992083312%

Token: " no"
Probability: 0.000119557905994%

running gpt-4.1

Response

RESPONSE content: {“answer”:“no”}
RESPONSE token number(s): [1750]

Logprobs:
Token: “no”
Probability: 90.465043176398169%

Top Logprobs:
Token: “no”
Probability: 90.465043176398169%

Token: “yes”
Probability: 9.534945969075354%

Token: “No”
Probability: 0.000000099806134%

Token: “Yes”
Probability: 0.000000032402308%

Answer:

Logprobs have the expected distribution.


The task is so highly-instructed that the logprob of each token leading up to the output position is 0.0 - 100%

logprobs[0]
{'token': '{"', 'bytes': [123, 34], 'logprob': 0.0, 'top_logprobs': [{'token': '{"', 'bytes': [123, 34], 'logprob': 0.0, 'prob': 1.0}, {'token': "{'", 'bytes': [123, 39], 'logprob': -19.875, 'prob': 2.335593038800113e-09}, {'token': '```', 'bytes': [96, 96, 96], 'logprob': -21.5, 'prob': 4.59905537865397e-10}, {'token': '{\\"', 'bytes': [123, 92, 34], 'logprob': -23.375, 'prob': 7.052879851114916e-11}], 'prob': 1.0}
logprobs[1]
{'token': 'answer', 'bytes': [97, 110, 115, 119, 101, 114], 'logprob': 0.0, 'top_logprobs': [{'token': 'answer', 'bytes': [97, 110, 115, 119, 101, 114], 'logprob': 0.0, 'prob': 1.0}, {'token': 'ANSWER', 'bytes': [65, 78, 83, 87, 69, 82], 'logprob': -19.734375, 'prob': 2.688251109328749e-09}, {'token': ' answer', 'bytes': [32, 97, 110, 115, 119, 101, 114], 'logprob': -22.37109375, 'prob': 1.9246751109065142e-10}, {'token': '\tanswer', 'bytes': [9, 97, 110, 115, 119, 101, 114], 'logprob': -22.841796875, 'prob': 1.2020807997466449e-10}], 'prob': 1.0}
logprobs[2]
{'token': '":"', 'bytes': [34, 58, 34], 'logprob': 0.0, 'top_logprobs': [{'token': '":"', 'bytes': [34, 58, 34], 'logprob': 0.0, 'prob': 1.0}, {'token': '":', 'bytes': [34, 58], 'logprob': -22.375, 'prob': 1.917171513759029e-10}, {'token': "':'", 'bytes': [39, 58, 39], 'logprob': -26.0, 'prob': 5.109089028065546e-12}, {'token': '\\":\\"', 'bytes': [92, 34, 58, 92, 34], 'logprob': -30.4375, 'prob': 6.04173548070253e-14}], 'prob': 1.0}


tip: - be sure you are extracting from the correct token position. Numbers are unjoinable, do not have a leading space of their own, and in JSON, space must be added. You might be missing that the “99.9% token” is the space that comes before the value.

The “prob” field for human interpretation is added, and calculated without external function (from math import exp - which might be damaged elsewhere).

# create a logprobs object that also has probabilities
e = 2.718281828459
if response_dict["choices"][0]["logprobs"]:
    logprobs = response_dict["choices"][0]["logprobs"]["content"]
    for entry in logprobs:
        entry["prob"] = (
            e ** entry["logprob"]
        )  # "logprob" to probability
        for top in entry.get("top_logprobs", []):
            top["prob"] = (
                e ** top["logprob"]
            )  # "logprob" in "top_logprobs"
    lp = logprobs[3]  # the specific logprob entry for the actual answer token

Just to note, in my own example here, I just rewrote the system message (where the enums being injected are automated) to have whitespace in the JSON, the key names enclosed in backticks, that output is sent to an API…thus advancing the answer position one token forward and with a result of making the AI more sure about ambiguous fruits.

>>> for logprob in logprobs:
...     print(logprob['prob'])
 
1.0
1.0
0.9999920581810099
1.0
0.9975274032511579
1.0
1 Like