Hey all,
Was wondering what the best way of evaluating LLM chat responses were without an evaluation dataset with “expected” outputs. Chat responses are non-deterministic and therefore don’t have fixed/predictable outputs, and benchmarks such as MMLU/HELM often can’t distinguish how well a chatbot is performing, as noted in the LLM-as-a-Judge paper.
Would greatly appreciate any insights on how people evaluate these LLM responses without a defined eval test suite, thanks!
1 Like
What do you expect is really what it boils down to.
You can easily give the essence of the “answer” to an LLM for grading. It doesn’t need to be exact. All that matters is that the semantics somewhat resemble what’s expected & capture what you’re expecting
1 Like
The only way I could think of was random input. In keeping with your context.
I just query mine with the most obtuse things I can think of. And note the responses. But there are random word generators that can pipe to your prompt. For instance Crunch found at crunch - wordlist generator download | SourceForge.net