Worse results when using GPT-4o as an evaluator

pamela.fox · September 30, 2024, 11:32pm

I typically use GPT-4 when doing GPT-based evaluations of RAG answer quality. I have recently started experimenting with using GPT-4o as an evaluator, but the results seem inferior.

For example, here is an evaluation run of 200 questions for a GPT-based metrics that we call “groundedness” and “relevance”:

For each metric, the GPT must give it a score from 1-5, with 5 being the best.

For groundedness, GPT-4 averaged 4.98 while GPT-4o averaged 4.86.
For relevance, GPT-4o averaged 4.94 while GPT-4o averaged 4.57.
I did a spot check of the answers, and I tended to agree with GPT-4 more than GPT-4o.

You can see the groundedness prompt here:

github.com

Azure-Samples/ai-rag-chat-evaluator/blob/main/scripts/evaluate_metrics/prompts/mygroundedness.prompty#L27


      
              response_format:
                type: text
          
          inputs:
            answer:
              type: string
            context:
              type: string
          
          ---
          system:
          You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. You should return a single integer value between 1 to 5 representing the evaluation metric. You will include no other text or information.
          user:
          You will be presented with a CONTEXT and an ANSWER about that CONTEXT. You need to decide whether the ANSWER is entailed by the CONTEXT by choosing one of the following rating:
          1. 5: The ANSWER follows logically from the information contained in the CONTEXT.
          2. 1: The ANSWER is logically false from the information contained in the CONTEXT.
          3. an integer score between 1 and 5 and if such integer score does not exist, use 1: It is not possible to determine whether the ANSWER is true or false without further information. Read the passage of information thoroughly and select the correct answer from the three answer labels. Read the CONTEXT thoroughly to ensure you know what the CONTEXT entails. Note the ANSWER is generated by a computer system, it can contain certain symbols, which should not be a negative factor in the evaluation.
          Independent Examples:
          ## Example Task #1 Input:
          {"CONTEXT": "Some are reported as not having been wanted at all.", "QUESTION": "", "ANSWER": "All are reported as being completely and fully wanted."}
          ## Example Task #1 Output:

And the relevance prompt here:

github.com

Azure-Samples/ai-rag-chat-evaluator/blob/main/scripts/evaluate_metrics/prompts/myrelevance.prompty

---
name: Relevance
description: Evaluates relevance score for QA scenario
model:
  api: chat
  configuration:
    type: azure_openai
    azure_deployment: ${env:AZURE_DEPLOYMENT}
    api_key: ${env:AZURE_OPENAI_API_KEY}
    azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
  parameters:
    temperature: 0.0
    max_tokens: 1
    top_p: 1.0
    presence_penalty: 0
    frequency_penalty: 0
    response_format:
      type: text

inputs:

This file has been truncated. show original

I’d love to hear other folks experience with using GPT-4o for GPT-based evaluations. Thanks!

PaulBellow · September 30, 2024, 11:40pm

Welcome to the dev community forum!

Good first post! Hope you stick around. We’ve got a lot of gems scattered about. We try to keep up with tagging and categorizing everything correctly, but with a forum this size, it’s quite the task!

Again, good to have you with us. And thanks for breaking out code on first post!

anon25271712 · October 1, 2024, 2:17am

It would be nice to view how much longer did gpt-4 take compared to gpt-4o. If I recall correctly, gpt-4o is a quantization of gpt-4-turbo. But I could be wrong about this!

I’ve posted a comment on a topic comparing both models. The chart I provided was published by OpenAI on github, if I recall correctly, it was the simple-eval repository.

Topic		Replies	Views
GPT4o - performance comparison with GPT4 API gpt-4	0	724	June 7, 2024
GPT-4o vs. gpt-4-turbo-2024-04-09, gpt-4o loses API gpt-4	38	15552	June 11, 2024
Comparing GPT-4 to GPT-4o API gpt-4	4	1984	May 14, 2024
List of fresh gpt-4o benchmarks, please add Community gpt-4o	1	3716	May 16, 2024
Gpt-4o tokens per second comparable to gpt-3.5-turbo. Data and analysis API gpt-4 , gpt-35-turbo , playground , gpt-4-turbo , gpt-4o	3	14191	August 16, 2024

Worse results when using GPT-4o as an evaluator

Related topics