Evaluating LLM Chat Responses without Evaluation Dataset?

What do you expect is really what it boils down to.

You can easily give the essence of the “answer” to an LLM for grading. It doesn’t need to be exact. All that matters is that the semantics somewhat resemble what’s expected & capture what you’re expecting

1 Like