Metrics for evaluating the answers endpoint?

Hello,

Is there a standard set of KPIs or metrics to evaluate the reply given by the answers endpoint?

We’re going to start testing internally our deployment and I’d like input from the team on each answer. I came up with the following “rate from 1 to 5” each answer:

  • Does the answer make sense for the question asked? e.g you ask about bakeries and the answer is about neighborhoods
  • Does the answer actually answered your question? e.g. you ask about the best bakery and the answer is one bakery
  • Is there information on the answer that should not be there? e.g. you ask about the best bakery and the answer is one bakery but there’s also info about neighborhoods on the answer.
  • Is it a good answer to your question? e.g you ask about the best restaurant and the answer should not be McDonalds.
  • Is the answer properly justified? e.g. you ask about the best restaurant and the answer is the name of the restaurant, along with a small text about why it is the best restaurant

Am I on the right track? Any input is highly appreciated! Merci!

Looks like a good start. I would also add some metrics like how many times the user tried a different version of the answer where applicable and cost of the answer in tokens (to mesure the money efficiency of the engine)

Also, if possible makes sense to rate overall human/bot session efficiency :

How many questions were asked by human to get the result they need for a standard task?
How many tokens were used to answer all human questions efficiently on the given task?
What time did it take to get the task completed?

Example of the standard task to complete:

What is the best restaurant not far from Eifel tower and how to book a table there for tonight. How to get there by 8pm from where I am now.

Hello Peter, while there’s no single, universal metric for accuracy, I think your approach is really great, and I could see that being valuable for many teams