Evaluating the confidence levels of outputs generated by Large Language Models (GPT-4o)

Greetings All,

I’m seeking guidance on evaluating the confidence levels of outputs generated by Large Language Models (LLMs).

Use Case: I provide a document to the model and request the extraction of approximately 10 key-value pairs, specifying the keys to be extracted. I aim to assess the model’s confidence in its outputs.

Note: The document templates vary and are not consistent. I am considering using GPT-4o to process the document images directly.

Any insights or recommendations on how to effectively measure the model’s confidence in this context would be greatly appreciated.

During my research, I got to know few approaches:

  1. I can call another LLM model and pass the extracted key value pairs along with the document and ask it to give the confidence score for the extracted key value pairs.
  2. I got to know about logsprob parameter, which is the probability of the next output token generated by the LLM.
1 Like

Hi @RaviKasaudhan and welcome to the community!

Both of your approaches are reasonable. Note however when it comes to logprobs in modern OpenAI models, it seems to have little meaning/use as it can give wide variance in logprobs, even under all other conditions held constant.

Your first approach, what is referred to as “LLM-as-a-judge”, is quite popular. You can go as far as having three judges and taking an average or winner-takes-all, but this gets a bit costly - depends on how you price confidence.

4 Likes

First, great response.

Could there be any sort of aggregation of logprobs to maybe have an idea of confidence? As in, run the same query N times and somehow combine the logprobs from all responses?

2 Likes

Hi @mat.eo ! Also sounds reasonable to me. Personally I have moved away from logprobs with proprietary models since I have no visibility upstream at all. Open models is another matter entirely :wink:

3 Likes

Hi @platypus,

Thank you for your response.

Have you explored any alternative methods to assess confidence levels? I would be interested to learn about your experiences.

Also, would love to know if you have any idea as per my usecase (explained in top comment)

1 Like