Hello OpenAI Developer Community,
Over the last few weeks, I have noticed a recurring theme in our discussions: many of us have been expressing concerns about an apparent decline in the performance of the GPT-4 chat model. While these observations are valuable, I believe that a more systematic, objective approach would be beneficial in assessing these claims.
The purpose of this post is to propose a rigorous framework for testing the different GPT models, whether it’s GPT-4 or those interfaced through the API. Rather than relying on subjective experiences or individual anecdotes, I think it would be more fruitful to devise a series of logical, mathematical, and programming tests to evaluate the capabilities of these models in a measurable way.
The tests I propose include:
-
Logic Tests: Simple puzzles or problem-solving tasks to evaluate the model’s ability to use deductive reasoning.
-
Mathematical Multiple-Choice Questions (MCQs): These can provide insight into the model’s numerical processing and mathematical reasoning abilities. The direct nature of MCQs – right or wrong – ensures an easy way to score these tests.
-
Programming and Algorithmic Exercises: Code writing tasks can evaluate the model’s understanding of programming logic, syntax, and its ability to solve problems algorithmically.
Ideally, for MCQs and programming tasks, we should not only look at the final answer but also request the model to provide a step-by-step explanation of how it arrived at the solution. This process will help us understand the “thought process” behind the model’s answers, giving us a glimpse into how it processes and manipulates information.
To provide a concrete example of the kind of testing I propose, I’ve prepared a MCQ test from the 2023 French Baccalaureate mathematics exam. It is a problem based on probability and statistics which I believe is quite relevant to the tasks we expect our AI models to perform.
The test is as follows:
"Exercise 1 (5 points)
This exercise is a multiple choice questionnaire.
For each question, only one of the four proposed answers is correct. The candidate
will indicate on his copy the number of the question and the chosen answer. No justification is
asked.
No point is removed in the absence of an answer or in case of incorrect answer.
A video game has a large online player community. Before starting a game,
the player must choose between two “worlds”: either world A, or world B.
A person is chosen at random in the community of players.
When playing a game, we assume that:
• the probability that the player chooses world A is 2/5;
• if the player chooses world A, the probability that they win the game is 7/10 ;
• the probability that the player wins the game is 12/25
We consider the following events:
• A: “The player chooses world A”;
• B: “The player chooses world B”;
• G: “The player wins the game”.
-
The probability that the player chooses world A and wins the game is equal to:
a. 7/10 b.3/25 c.7/25 d. 24/125 -
The probability P-(G) of event G given that B is realized is equal to:
A. 1/5 b.1/3 c.7/15 d.5/12
In the rest of the exercise, a player plays 10 consecutive games. This situation is assimilated to a random draw with replacement. Remember that the probability of winning a game is 12/25
- The probability, rounded to the nearest thousand
th, that the player wins exactly 6 games is equal to:
a. 0,859 b. 0,671 c. 0,188 d. 0,187
-
We consider a natural integer n for which the probability, rounded to the nearest thousandth, that the player wins at most n games is 0,207. Then:
a. n = 2 b. n = 3 c. n = 4 d. n = 5 -
The probability that the player wins at least one game is equal to:
a. 1 -(12/25)^10
B. (13/25)^10
C. (12/25)^10
D. 1-(13/25)^10
"
The correct answers are:
- C
- B
- C
- B
- D
Implementing a structured and factual testing system like this can help us track the progress (or regress, if applicable) of these models over time. This would give us tangible proof and a clearer picture of the models’ capabilities and their evolution.
I welcome your thoughts and feedback on this proposal. It’s essential that we work together to ensure our tools are improving and adapting to our needs in the best possible manner.
Best,
Joris Blestel Villaseque