Evaluating the confidence levels of outputs generated by Large Language Models (GPT-4o)

RaviKasaudhan · February 22, 2025, 6:22pm

Greetings All,

I’m seeking guidance on evaluating the confidence levels of outputs generated by Large Language Models (LLMs).

Use Case: I provide a document to the model and request the extraction of approximately 10 key-value pairs, specifying the keys to be extracted. I aim to assess the model’s confidence in its outputs.

Note: The document templates vary and are not consistent. I am considering using GPT-4o to process the document images directly.

Any insights or recommendations on how to effectively measure the model’s confidence in this context would be greatly appreciated.

During my research, I got to know few approaches:

I can call another LLM model and pass the extracted key value pairs along with the document and ask it to give the confidence score for the extracted key value pairs.
I got to know about logsprob parameter, which is the probability of the next output token generated by the LLM.

platypus · February 22, 2025, 6:34pm

Hi @RaviKasaudhan and welcome to the community!

Both of your approaches are reasonable. Note however when it comes to logprobs in modern OpenAI models, it seems to have little meaning/use as it can give wide variance in logprobs, even under all other conditions held constant.

Your first approach, what is referred to as “LLM-as-a-judge”, is quite popular. You can go as far as having three judges and taking an average or winner-takes-all, but this gets a bit costly - depends on how you price confidence.

mat.eo · February 22, 2025, 6:40pm

First, great response.

Could there be any sort of aggregation of logprobs to maybe have an idea of confidence? As in, run the same query N times and somehow combine the logprobs from all responses?

platypus · February 22, 2025, 6:57pm

Hi @mat.eo ! Also sounds reasonable to me. Personally I have moved away from logprobs with proprietary models since I have no visibility upstream at all. Open models is another matter entirely

RaviKasaudhan · February 23, 2025, 4:52pm

Hi @platypus,

Thank you for your response.

Have you explored any alternative methods to assess confidence levels? I would be interested to learn about your experiences.

Also, would love to know if you have any idea as per my usecase (explained in top comment)

sales66 · June 8, 2025, 8:30pm

Following up on this important topic, I’ve posted a proposal with five core principles for grounding language models in verifiable behavior. The post is here Toward Truth-Aligned AI: A Proposal for Grounded Language Models. Would love your thoughts.

Topic		Replies	Views
Gpt-4o-mini response evaluation Community gpt-4 , rag , evals	3	296	February 17, 2025
A confidence level associated with GPT responses API	5	6039	July 11, 2024
Confidence score for prompt response API	11	14887	December 21, 2023
Alternative Q&A formats question API	5	728	January 3, 2024
Thought/answer pattern while evaluating confidence from logprobs API chatgpt , classification	0	631	May 24, 2024

Evaluating the confidence levels of outputs generated by Large Language Models (GPT-4o)

Related topics