Evaluating LLM Chat Responses without Evaluation Dataset?

johnald · June 14, 2024, 6:56pm

Hey all,

Was wondering what the best way of evaluating LLM chat responses were without an evaluation dataset with “expected” outputs. Chat responses are non-deterministic and therefore don’t have fixed/predictable outputs, and benchmarks such as MMLU/HELM often can’t distinguish how well a chatbot is performing, as noted in the LLM-as-a-Judge paper.

Would greatly appreciate any insights on how people evaluate these LLM responses without a defined eval test suite, thanks!

RonaldGRuckus · June 14, 2024, 7:46pm

What do you expect is really what it boils down to.

You can easily give the essence of the “answer” to an LLM for grading. It doesn’t need to be exact. All that matters is that the semantics somewhat resemble what’s expected & capture what you’re expecting

ra_0929 · June 14, 2024, 7:49pm

The only way I could think of was random input. In keeping with your context.
I just query mine with the most obtuse things I can think of. And note the responses. But there are random word generators that can pipe to your prompt. For instance Crunch found at crunch - wordlist generator download | SourceForge.net

Topic		Replies	Views
Benchmark & Evaluation Frameworks for Assistants API gpt-4 , chatgpt , api , assistants-api	0	420	April 25, 2024
How to evaluate chat conversations (not just question-answer pairs) GPT builders gpts	5	1731	February 15, 2024
Need human like response to test the model performance API	3	1338	November 29, 2023
Evaluation Tools for Assistants API gpt-4 , gpt-35-turbo , chatgpt , api , assistants-api	1	651	April 12, 2024
How to efficiently create ground truth sets using GPT-4? Prompting gpt-4 , chatgpt	1	182	October 11, 2024

Evaluating LLM Chat Responses without Evaluation Dataset?

Related topics