Proposing an Objective Testing Framework for GPT Models - Including a Sample Test 🚀

Hello OpenAI Developer Community,

Over the last few weeks, I have noticed a recurring theme in our discussions: many of us have been expressing concerns about an apparent decline in the performance of the GPT-4 chat model. While these observations are valuable, I believe that a more systematic, objective approach would be beneficial in assessing these claims.

The purpose of this post is to propose a rigorous framework for testing the different GPT models, whether it’s GPT-4 or those interfaced through the API. Rather than relying on subjective experiences or individual anecdotes, I think it would be more fruitful to devise a series of logical, mathematical, and programming tests to evaluate the capabilities of these models in a measurable way.

The tests I propose include:

  1. Logic Tests: Simple puzzles or problem-solving tasks to evaluate the model’s ability to use deductive reasoning.

  2. Mathematical Multiple-Choice Questions (MCQs): These can provide insight into the model’s numerical processing and mathematical reasoning abilities. The direct nature of MCQs – right or wrong – ensures an easy way to score these tests.

  3. Programming and Algorithmic Exercises: Code writing tasks can evaluate the model’s understanding of programming logic, syntax, and its ability to solve problems algorithmically.

Ideally, for MCQs and programming tasks, we should not only look at the final answer but also request the model to provide a step-by-step explanation of how it arrived at the solution. This process will help us understand the “thought process” behind the model’s answers, giving us a glimpse into how it processes and manipulates information.

To provide a concrete example of the kind of testing I propose, I’ve prepared a MCQ test from the 2023 French Baccalaureate mathematics exam. It is a problem based on probability and statistics which I believe is quite relevant to the tasks we expect our AI models to perform.

The test is as follows:

"Exercise 1 (5 points)
This exercise is a multiple choice questionnaire.
For each question, only one of the four proposed answers is correct. The candidate
will indicate on his copy the number of the question and the chosen answer. No justification is
asked.
No point is removed in the absence of an answer or in case of incorrect answer.

A video game has a large online player community. Before starting a game,
the player must choose between two “worlds”: either world A, or world B.
A person is chosen at random in the community of players.
When playing a game, we assume that:
• the probability that the player chooses world A is 2/5;
• if the player chooses world A, the probability that they win the game is 7/10 ;
• the probability that the player wins the game is 12/25

We consider the following events:
• A: “The player chooses world A”;
• B: “The player chooses world B”;
• G: “The player wins the game”.

  1. The probability that the player chooses world A and wins the game is equal to:
    a. 7/10 b.3/25 c.7/25 d. 24/125

  2. The probability P-(G) of event G given that B is realized is equal to:
    A. 1/5 b.1/3 c.7/15 d.5/12

In the rest of the exercise, a player plays 10 consecutive games. This situation is assimilated to a random draw with replacement. Remember that the probability of winning a game is 12/25

  1. The probability, rounded to the nearest thousand

th, that the player wins exactly 6 games is equal to:
a. 0,859 b. 0,671 c. 0,188 d. 0,187

  1. We consider a natural integer n for which the probability, rounded to the nearest thousandth, that the player wins at most n games is 0,207. Then:
    a. n = 2 b. n = 3 c. n = 4 d. n = 5

  2. The probability that the player wins at least one game is equal to:
    a. 1 -(12/25)^10
    B. (13/25)^10
    C. (12/25)^10
    D. 1-(13/25)^10
    "

The correct answers are:

  1. C
  2. B
  3. C
  4. B
  5. D

Implementing a structured and factual testing system like this can help us track the progress (or regress, if applicable) of these models over time. This would give us tangible proof and a clearer picture of the models’ capabilities and their evolution.

I welcome your thoughts and feedback on this proposal. It’s essential that we work together to ensure our tools are improving and adapting to our needs in the best possible manner.

Best,
Joris Blestel Villaseque

I’d like to share the preliminary results of the testing framework I proposed. It’s important to note that these results, though interesting, should not be seen as conclusive due to the limited number of tests conducted. For a more accurate depiction of performance, I believe that each model should be tested at least twenty times. This way, we can calculate an average score that provides a more reliable indication of trends.

Here are the initial results:

  1. GPT-4 (June 12 model):
  • Initial response accuracy: 3/5
  • Upon request for correction: 5/5 (Excellent response)
  1. GPT-4 API:
  • First Test: 4/5
  • Second Test: 2/5
  1. Wolfram Plugin:
  • Accuracy: 4/5
  1. GPT-3.5:
  • Accuracy: 2/5

Based on these preliminary results, there doesn’t seem to be a significant discrepancy between the performance of the model used in chat gpt and the GPT-4 we retrieve via the API. However, we must exercise caution as the reliability of these results is not yet firmly established due to the small sample size.

For the next steps, I recommend conducting additional tests to confirm or adjust these preliminary results. A larger data set will allow for more meaningful and reliable analysis of the results.

I also wanted to emphasize the importance of using the API for testing. This ensures that we are working with a static version of the model that remains unchanged until OpenAI notifies us of updates. Please note that these tests were conducted prior to the API update to GPT-4-0613.

I look forward to your thoughts and continued participation in this initiative. Your input is invaluable as we continually strive for optimal performance from our AI models.

This is an interesting idea. Wondering if we can create more exams on the LLM that is closer to individual use cases.

Yes absolutely, that’s why I’m sharing. The idea would be that we have a large standardized test so that we can evaluate the models as factually as possible !