Metrics for evaluating the answers endpoint?

topercer.peter · April 30, 2021, 8:12am

Hello,

Is there a standard set of KPIs or metrics to evaluate the reply given by the answers endpoint?

We’re going to start testing internally our deployment and I’d like input from the team on each answer. I came up with the following “rate from 1 to 5” each answer:

Does the answer make sense for the question asked? e.g you ask about bakeries and the answer is about neighborhoods
Does the answer actually answered your question? e.g. you ask about the best bakery and the answer is one bakery
Is there information on the answer that should not be there? e.g. you ask about the best bakery and the answer is one bakery but there’s also info about neighborhoods on the answer.
Is it a good answer to your question? e.g you ask about the best restaurant and the answer should not be McDonalds.
Is the answer properly justified? e.g. you ask about the best restaurant and the answer is the name of the restaurant, along with a small text about why it is the best restaurant

Am I on the right track? Any input is highly appreciated! Merci!

sergeliatko · April 30, 2021, 9:58am

Looks like a good start. I would also add some metrics like how many times the user tried a different version of the answer where applicable and cost of the answer in tokens (to mesure the money efficiency of the engine)

sergeliatko · April 30, 2021, 10:01am

Also, if possible makes sense to rate overall human/bot session efficiency :

How many questions were asked by human to get the result they need for a standard task?
How many tokens were used to answer all human questions efficiently on the given task?
What time did it take to get the task completed?

sergeliatko · April 30, 2021, 10:04am

Example of the standard task to complete:

What is the best restaurant not far from Eifel tower and how to book a table there for tonight. How to get there by 8pm from where I am now.

joey · April 30, 2021, 11:09am

Hello Peter, while there’s no single, universal metric for accuracy, I think your approach is really great, and I could see that being valuable for many teams

Topic		Replies	Views
What is the best metrics to calculate how correctly the llm is giving answer API	0	847	May 15, 2023
A travel bot about visiting Paris: search or answers endpoint + other newbie questions API	8	918	January 3, 2024
Measuring accuracy and precision Prompting	1	2495	March 23, 2022
Proposing an Objective Testing Framework for GPT Models - Including a Sample Test 🚀 Community gpt-4 , gpt-35-turbo , chatgpt , api	3	1274	June 14, 2023
How to test an API, built on GPT? API	2	2223	April 9, 2024

Metrics for evaluating the answers endpoint?

Related topics