I typically use GPT-4 when doing GPT-based evaluations of RAG answer quality. I have recently started experimenting with using GPT-4o as an evaluator, but the results seem inferior.
For example, here is an evaluation run of 200 questions for a GPT-based metrics that we call “groundedness” and “relevance”:
For each metric, the GPT must give it a score from 1-5, with 5 being the best.
For groundedness, GPT-4 averaged 4.98 while GPT-4o averaged 4.86.
For relevance, GPT-4o averaged 4.94 while GPT-4o averaged 4.57.
I did a spot check of the answers, and I tended to agree with GPT-4 more than GPT-4o.
You can see the groundedness prompt here:
And the relevance prompt here:
I’d love to hear other folks experience with using GPT-4o for GPT-based evaluations. Thanks!
Good first post! Hope you stick around. We’ve got a lot of gems scattered about. We try to keep up with tagging and categorizing everything correctly, but with a forum this size, it’s quite the task!
Again, good to have you with us. And thanks for breaking out code on first post!
It would be nice to view how much longer did gpt-4 take compared to gpt-4o. If I recall correctly, gpt-4o is a quantization of gpt-4-turbo. But I could be wrong about this!
I’ve posted a comment on a topic comparing both models. The chart I provided was published by OpenAI on github, if I recall correctly, it was the simple-eval repository.