ChatGPT finding by stanford researchers

There are four basic tests they’ve done,

  1. Solving math problems
  2. Answering “sensitive” questions
  3. Code generation
  4. Visual reasoning

Let’s dispense with them, one-by-one.

Math Problems

I’ve enumerated problems with this test already elsewhere here. But, here’s another source which treads the same ground,

Sensitive questions

If you actually read through the “sensitive” questions, you’ll quickly see they’re just bigoted, deeply offensive questions—of course they’re being filtered.

Code generation

Their entire test was whether or not the entire response was code which could be run as-is. The problem with that is that gpt-3.5 and gpt-4 are models which have been fine-tuned for chat. They’re chatty and getting more so. With the newer models the authors counted code generation as a fail if the response was formatted with markdown. So,

def pythagoras(a, b):
    if a < 0 or b < 0:
        return None
    else:
        return (a**2 + b**2)**0.5

Would pass but,

```
def pythagoras(a, b):
    if a < 0 or b < 0:
        return None
    else:
        return (a**2 + b**2)**0.5
```

would not.

When the requested code is extracted from within the fencing, the newer models outperform the earlier models.

Source:
Deceptive definition of "directly executable" code · Issue #3 · lchen001/LLMDrift · GitHub

Visual reasoning

The authors claim this metric has improved.

So, we have one deeply flawed test, one which is exactly as we should expect, one where they choose to measure something incredibly minute and not reflective of what they claim, and one where the model improved based on their tests.

That’s not strong evidence in support of a decline in model quality or capability.

Beyond that, this paper is a pre-print (which is fine, lots of researchers publish pre-prints) it just means it has not undergone peer-review. After reading this paper as a research myself, my conclusion is this paper would never be published in any reputable journal because it lacks rigor and the tests they have conducted fail to isolate the effects they claim to be measuring.

If this were submitted as a project paper by a student in any class I TA, it would be a 6/10 if I were in a generous mood.

4 Likes