How to evaluate chat conversations (not just question-answer pairs)

piotrbrodka95 · December 8, 2023, 3:41pm

I’ve watched The New Stack and Ops for AI video
(https://www.youtube.com/watch?v=XGJNo8TpuVA)

They explained how to evaluate models, for example to score outputs using GPT4. They are using it to evaluate (question, outputs) pair.

But how to evaluate conversation? I know that conversation is basically a list of pairs (question, answer), but it is not exaclty the same, because there are relations between them (answer for question x may contain some info from previous questions).

Do you know methods, that are good here for chat conversation evaluation?

linda.j · February 9, 2024, 6:22pm

We are building a simulation testing platform specially designed to test and evaluation dynamtic conversation. Would love to learn more about your use cases and exchange ideas.

SomebodySysop · February 10, 2024, 10:31am

I’m still in testing mode, but I’ve got every question and response generated by model over the past 7 months, most ranked (evaluated) by the AI itself. I think I mostly used gpt-3.5 for ranking. I could export all of this into a spreadsheet (the questions and answers are in one cell, separated as “Question:” and “Response”.

But I’m curious: When you say “evaluate the conversations”, what are we looking for? In my case, the overall poorest results have been from gpt-3.5-turbo-16k – hands down. The best from the gpt-4 and gpt-4.5 models.

linda.j · February 12, 2024, 1:12pm

Can I ask what is your uss cases of yourAI? Do you evaluate dynamic input (user’s input changes dynamically based on AI’s output) in a natural conversion?

SomebodySysop · February 12, 2024, 1:34pm

3 Knowledge bases: Entertainment industry labor contracts, CA real estate law and religious documents (4 bibles, 3 Talmuds and 2 Tanakhs). I maintain a log of all queries and responses. Some these will be running conversations.

I keep this data in order to do fine-tuning down the line. And, of course, for users to be able to download their conversations for whatever uses they might have.

Never really thought much about evaluation other than whether they are getting consistently good or bad answers.

SomebodySysop · February 15, 2024, 7:22am

So, are we talking about something like this? As I said, I record the query/response pairs. Here is an LLM generated list of the last 10 queries from unique users, along with an LLM generated analysis of those queries based upon my question:

Topic		Replies	Views
Evaluating LLM Chat Responses without Evaluation Dataset? API gpt-4 , assistants-api	2	701	June 14, 2024
Need human like response to test the model performance API	3	1527	November 29, 2023
How to know whether GPT I built on GPT store is working as expected? Community gpt-4	3	983	February 14, 2024
GPT powered learning solution API api	22	2549	February 22, 2026
Exploring AI's Capability to Evaluate Conversations: A Feasibility Inquiry Prompting gpt-4 , exams , prompt , qualification	2	1124	February 5, 2024

How to evaluate chat conversations (not just question-answer pairs)

Related topics