How to evaluate chat conversations (not just question-answer pairs)

I’ve watched The New Stack and Ops for AI video
(https://www.youtube.com/watch?v=XGJNo8TpuVA)

They explained how to evaluate models, for example to score outputs using GPT4. They are using it to evaluate (question, outputs) pair.

But how to evaluate conversation? I know that conversation is basically a list of pairs (question, answer), but it is not exaclty the same, because there are relations between them (answer for question x may contain some info from previous questions).

Do you know methods, that are good here for chat conversation evaluation?

We are building a simulation testing platform specially designed to test and evaluation dynamtic conversation. Would love to learn more about your use cases and exchange ideas.

I’m still in testing mode, but I’ve got every question and response generated by model over the past 7 months, most ranked (evaluated) by the AI itself. I think I mostly used gpt-3.5 for ranking. I could export all of this into a spreadsheet (the questions and answers are in one cell, separated as “Question:” and “Response”.

But I’m curious: When you say “evaluate the conversations”, what are we looking for? In my case, the overall poorest results have been from gpt-3.5-turbo-16k – hands down. The best from the gpt-4 and gpt-4.5 models.

Can I ask what is your uss cases of yourAI? Do you evaluate dynamic input (user’s input changes dynamically based on AI’s output) in a natural conversion?

3 Knowledge bases: Entertainment industry labor contracts, CA real estate law and religious documents (4 bibles, 3 Talmuds and 2 Tanakhs). I maintain a log of all queries and responses. Some these will be running conversations.

I keep this data in order to do fine-tuning down the line. And, of course, for users to be able to download their conversations for whatever uses they might have.

Never really thought much about evaluation other than whether they are getting consistently good or bad answers.

So, are we talking about something like this? As I said, I record the query/response pairs. Here is an LLM generated list of the last 10 queries from unique users, along with an LLM generated analysis of those queries based upon my question: