There are four basic tests they’ve done,
- Solving math problems
- Answering “sensitive” questions
- Code generation
- Visual reasoning
Let’s dispense with them, one-by-one.
Math Problems
I’ve enumerated problems with this test already elsewhere here. But, here’s another source which treads the same ground,
Sensitive questions
If you actually read through the “sensitive” questions, you’ll quickly see they’re just bigoted, deeply offensive questions—of course they’re being filtered.
Code generation
Their entire test was whether or not the entire response was code which could be run as-is. The problem with that is that gpt-3.5 and gpt-4 are models which have been fine-tuned for chat. They’re chatty and getting more so. With the newer models the authors counted code generation as a fail if the response was formatted with markdown. So,
def pythagoras(a, b):
if a < 0 or b < 0:
return None
else:
return (a**2 + b**2)**0.5
Would pass but,
```
def pythagoras(a, b):
if a < 0 or b < 0:
return None
else:
return (a**2 + b**2)**0.5
```
would not.
When the requested code is extracted from within the fencing, the newer models outperform the earlier models.
Source:
Deceptive definition of "directly executable" code · Issue #3 · lchen001/LLMDrift · GitHub
Visual reasoning
The authors claim this metric has improved.
So, we have one deeply flawed test, one which is exactly as we should expect, one where they choose to measure something incredibly minute and not reflective of what they claim, and one where the model improved based on their tests.
That’s not strong evidence in support of a decline in model quality or capability.
Beyond that, this paper is a pre-print (which is fine, lots of researchers publish pre-prints) it just means it has not undergone peer-review. After reading this paper as a research myself, my conclusion is this paper would never be published in any reputable journal because it lacks rigor and the tests they have conducted fail to isolate the effects they claim to be measuring.
If this were submitted as a project paper by a student in any class I TA, it would be a 6/10 if I were in a generous mood.