Worse results when using GPT-4o as an evaluator

I typically use GPT-4 when doing GPT-based evaluations of RAG answer quality. I have recently started experimenting with using GPT-4o as an evaluator, but the results seem inferior.

For example, here is an evaluation run of 200 questions for a GPT-based metrics that we call “groundedness” and “relevance”:

For each metric, the GPT must give it a score from 1-5, with 5 being the best.

For groundedness, GPT-4 averaged 4.98 while GPT-4o averaged 4.86.
For relevance, GPT-4o averaged 4.94 while GPT-4o averaged 4.57.
I did a spot check of the answers, and I tended to agree with GPT-4 more than GPT-4o.

You can see the groundedness prompt here:

And the relevance prompt here:

I’d love to hear other folks experience with using GPT-4o for GPT-based evaluations. Thanks!

3 Likes

Welcome to the dev community forum!

Good first post! Hope you stick around. We’ve got a lot of gems scattered about. We try to keep up with tagging and categorizing everything correctly, but with a forum this size, it’s quite the task!

Again, good to have you with us. And thanks for breaking out code on first post! :slight_smile:

It would be nice to view how much longer did gpt-4 take compared to gpt-4o. If I recall correctly, gpt-4o is a quantization of gpt-4-turbo. But I could be wrong about this!

I’ve posted a comment on a topic comparing both models. The chart I provided was published by OpenAI on github, if I recall correctly, it was the simple-eval repository.

1 Like