Gpt-4o-mini response evaluation

debasis.bhadra09 · February 17, 2025, 3:03am

Our application involves a Retrieval-Augmented Generation (RAG)-based system where users ask questions and are provided with predefined selection options. The AI is responsible for selecting the most appropriate option based on documents (PDFs) while adhering to an industry-specific “thought process” defined in the prompt.We are using gpt-4o-mini for our application.
To enhance transparency, we implemented RAG triad with TruLence to compute confidence scores for the AI’s selected answers. However, we have encountered a significant challenge:

Since the answer selection is based not purely on semantic meaning but on logical reasoning defined in the thought process prompt, we are observing misalignment between confidence scores and answer correctness.
In many cases, when the AI correctly selects an option based on the thought process, the confidence score is unexpectedly low. Conversely, when an incorrect response is selected based on semantic similarity, the confidence score is high.

This suggests that during context retrieval and answer evaluation, the model is prioritizing semantic alignment over the reasoning framework explicitly provided in the prompt.
Is there any framework available which can help us to measure the context relevance, answer relevance based on the thought process and instruction (reasoning) and generate a true AI confidence score for each response.

sergeliatko · February 17, 2025, 8:43am

I have a similar setup with analysis process of legal documents, here is how I approached the problem:

Instead of running a single query of user request to the vector database, another llm model and application configuration suggests related queries which are run in parallel to grab more context.
Once all elements are retrieved from the vector database there is a filter which evaluates their relation to the original user query (preselection of results by their usefulness).
The answering llm model (also 4o-mini) has a predefined prompt for each of the analysis steps with a list of accepted/possible answers to choose from based on the inquiry and retrieved samples.
The answer from step 3 is controlled by classic code to see if it matches, if the llm failed to provide accepted answer there is a logic to either provide default answer or retry.
The final answer is done by another llm which is provided with retrieved and preselected samples, the answer from step number three and whose task is to reason and explain why this step number three produced this result (this may be not applicable to your situation).
The last step before returning the answer to the user is actually combining the answers from number three and five and formats it in a more digestible form. In your situation it may be simple code but it depends on the application.

So for your case the confidence would be evaluated for the response number 3, while the answer provided to the user would be built from step number five and six.

Another note this process is applied for one step within thinking flow. So if your application has a more elaborate decision making process, I would suggest break it down into single steps which you could parallelize if possible (was possible for me).

Hope that helps.

debasis.bhadra09 · February 17, 2025, 6:29pm

Thank you for your approach sergeliatko.can you please elaborate how you are trying to generate the confidence score from step 3 and 5.which framework will be suitable and what type of linear equation I can create for generating a confidence score.

sergeliatko · February 17, 2025, 7:02pm

Using the logprobs in chat completions object: https://platform.openai.com/docs/api-reference/chat/object (step 3 only)

I have a setting in my app to consider a completion “valid” if the logprob of the first token is above a certain level.

This brings a trick to do (optional): make sure the acceptable options start with unique token so that it simplifies the check for confidence.

In my step 5 the confidence is irrelevant as the app is built around step 3 which is crucial, while step 5 is more “decorative” and allows wording variability.

Topic		Replies	Views
Evaluating the confidence levels of outputs generated by Large Language Models (GPT-4o) Community gpt-4	5	864	June 8, 2025
Confidence score for prompt response API	11	14339	December 21, 2023
Thought/answer pattern while evaluating confidence from logprobs API chatgpt , classification	0	605	May 24, 2024
Interpret scores generated by GPT4 Vision in multilabel classification API gpt-4 , gpt-4-vision	3	832	May 1, 2024
A confidence level associated with GPT responses API	5	5912	July 11, 2024

Gpt-4o-mini response evaluation

Related topics