Confidence score for prompt response

Hi all,

I am trying the prompt response, where I used the opeai.completion.creat() to get response, I set the temperature = 0 to get the most probable answer, but I am curious whether I can get the confidence score/probability of the response. Is there any way to get this number?

Thank you.


Hi @wmarch015 ,
could you please share with me if you have any progress on this topic.

The model doesn’t have any concept of “the response.”
The model predicts one token at a time. It may have some “confidence” (or “inverse perplexity”) of each token as it’s being generated, but it has no concept or control of the overall semantics of all generated tokens.
Inside the implementation, the code that runs the model calls the model to generate one token, and then iterates, until it determines the model is done (either by the model outputting the “done” token, or by the generated tokens matching a stop sequence.)

If you could get the “confidence” of each token, perhaps you could multiply that value together for all the tokens you generated, but I doubt that would be a very meaningful answer – mainly, because the model isn’t built to have any concept of “the totality of the answer.” As soon as one token has been generated, that token goes back into the “history/input” side of the model, and is treated as “given context” for the next token.

Hi @jwatte,
Great answer, thanks.

Do you know how is this related to the logprobs parameter available in the deprecated completions Api?

I have not used that one. From the documentation it sounds like it returns the probability of “the token” which might be “the last token it returned in the inference” but it seems a little ambiguous what it is it returns.

If there was something that returned the probability of each token that was actually chosen, that’d allow you to calculate the overall confidence, but it doesn’t sound like that’s what it is.

Separately: the “probability” here is very likely the output of a softmax operator (I don’t know if this is documented, but it smells like one,) which is nicely numerically behaved and tends to drive the model towards “making a choice,” but it’s not an exact percentage value of what it “should” be; it’s just an allocation of the choices that it it actually made. The model may be very confident, in the wrong answer :slight_smile: And why “predicted token weight” should map to “exponent of normalized weight” to generate “probability” is … hand-wavy. Seems to work OK in practice, though!

Alright, I tried the actual API, and it does return the list of logprobs for each token in both the input and output.

Thus, you could use that information to calculate the “probability” of this particular sequence being chosen, and using that as some form of “confidence.” But, in general, I wouldn’t put too much stock in that value, because it will be poorly behaved – for longer outputs, the probabilities will multiply out to lower overall probability, and if the model picks one low-probability token (which could be something unimportant like “and” instead of “the”) the overall probability will multiply out to much lower.

Anyway – if what you want is “confidence in the overall answer” then you can’t really construct that from “random probability of each word fragment token.”
This actually gives a pretty neat insight into why I think these models aren’t really “thinking,” too – they just predict, one token after the next, with no “overall” model of what they’re doing.


The logprobs is the value before any scaling is applied. It’s for this reason you will see something like:

Screenshot from 2023-08-29 12-13-23

(if that’s what you were referring to) (correct me if I’m wrong) (it’s been so long since I’ve used davinci via api but I think it’s actually returned as a value and the playground calculates the probabilities)

1 Like

Yes, that makes sense; I agree!

1 Like

That said, when you have a large body of responses, you can reinject them into the LLM and ask him to evaluate them.
Additionnaly, you can ask the LLM to review his own response, or probe internet for fact checking what he said himself and give a confidence score based on that.
Its all proxies but depending on your context it can help.

The API completion endpoint already includes a mechanism that does similar on the entire output: the “best_of” parameter. Acting on the score instead of reporting it.

Like returning the logit probabilities themselves, it is not available for chat models such as gpt-4.


Generates best_of completions server-side and returns the “best” (the one with the highest log probability per token). Results cannot be streamed.

When used with n, best_of controls the number of candidate completions and n specifies how many to return – best_of must be greater than n.

Note: Because this parameter generates many completions, it can quickly consume your token quota. Use carefully and ensure that you have reasonable settings for max_tokens and stop.

This also doesn’t ensure entailment quality, just an overall “minimum deviation from the plan” generation that uses highest probability tokens. It is more for when you’ve allowed such creativity in token choices by not constraining temperature.

The “best_of” parameter doesn’t actually return anything resembling an overall probability or confidence in the generated output, it controls the generation based on inferred probabilities, which is different.

Anyway, it sounds like we all agree:

  1. The old completion API does return probabilities, which can be multiplied together after conversion (or added, and then exponentiated) to calculate “the probability of the generated text.”
  2. This isn’t actually a very good measure of “confidence” anyway.
  3. The new chat completion API doesn’t even give you the data.
  4. LLMs don’t really have a concept of “the entire thing they generate” because that “entire thing” is a product of iteration that happens outside of the model itself; the model only considers each token at a time, so what the original request wants, doesn’t even exist.