Hello OpenAI Developer Community,
Over the last few weeks, I have noticed a recurring theme in our discussions: many of us have been expressing concerns about an apparent decline in the performance of the GPT4 chat model. While these observations are valuable, I believe that a more systematic, objective approach would be beneficial in assessing these claims.
The purpose of this post is to propose a rigorous framework for testing the different GPT models, whether itâ€™s GPT4 or those interfaced through the API. Rather than relying on subjective experiences or individual anecdotes, I think it would be more fruitful to devise a series of logical, mathematical, and programming tests to evaluate the capabilities of these models in a measurable way.
The tests I propose include:

Logic Tests: Simple puzzles or problemsolving tasks to evaluate the modelâ€™s ability to use deductive reasoning.

Mathematical MultipleChoice Questions (MCQs): These can provide insight into the modelâ€™s numerical processing and mathematical reasoning abilities. The direct nature of MCQs â€“ right or wrong â€“ ensures an easy way to score these tests.

Programming and Algorithmic Exercises: Code writing tasks can evaluate the modelâ€™s understanding of programming logic, syntax, and its ability to solve problems algorithmically.
Ideally, for MCQs and programming tasks, we should not only look at the final answer but also request the model to provide a stepbystep explanation of how it arrived at the solution. This process will help us understand the â€śthought processâ€ť behind the modelâ€™s answers, giving us a glimpse into how it processes and manipulates information.
To provide a concrete example of the kind of testing I propose, Iâ€™ve prepared a MCQ test from the 2023 French Baccalaureate mathematics exam. It is a problem based on probability and statistics which I believe is quite relevant to the tasks we expect our AI models to perform.
The test is as follows:
"Exercise 1 (5 points)
This exercise is a multiple choice questionnaire.
For each question, only one of the four proposed answers is correct. The candidate
will indicate on his copy the number of the question and the chosen answer. No justification is
asked.
No point is removed in the absence of an answer or in case of incorrect answer.
A video game has a large online player community. Before starting a game,
the player must choose between two â€śworldsâ€ť: either world A, or world B.
A person is chosen at random in the community of players.
When playing a game, we assume that:
â€˘ the probability that the player chooses world A is 2/5;
â€˘ if the player chooses world A, the probability that they win the game is 7/10 ;
â€˘ the probability that the player wins the game is 12/25
We consider the following events:
â€˘ A: â€śThe player chooses world Aâ€ť;
â€˘ B: â€śThe player chooses world Bâ€ť;
â€˘ G: â€śThe player wins the gameâ€ť.

The probability that the player chooses world A and wins the game is equal to:
a. 7/10 b.3/25 c.7/25 d. 24/125 
The probability P(G) of event G given that B is realized is equal to:
A. 1/5 b.1/3 c.7/15 d.5/12
In the rest of the exercise, a player plays 10 consecutive games. This situation is assimilated to a random draw with replacement. Remember that the probability of winning a game is 12/25
 The probability, rounded to the nearest thousand
th, that the player wins exactly 6 games is equal to:
a. 0,859 b. 0,671 c. 0,188 d. 0,187

We consider a natural integer n for which the probability, rounded to the nearest thousandth, that the player wins at most n games is 0,207. Then:
a. n = 2 b. n = 3 c. n = 4 d. n = 5 
The probability that the player wins at least one game is equal to:
a. 1 (12/25)^10
B. (13/25)^10
C. (12/25)^10
D. 1(13/25)^10
"
The correct answers are:
 C
 B
 C
 B
 D
Implementing a structured and factual testing system like this can help us track the progress (or regress, if applicable) of these models over time. This would give us tangible proof and a clearer picture of the modelsâ€™ capabilities and their evolution.
I welcome your thoughts and feedback on this proposal. Itâ€™s essential that we work together to ensure our tools are improving and adapting to our needs in the best possible manner.
Best,
Joris Blestel Villaseque