ChatGPT finding by stanford researchers

anon22939549 · July 23, 2023, 2:19am

There are four basic tests they’ve done,

Solving math problems
Answering “sensitive” questions
Code generation
Visual reasoning

Let’s dispense with them, one-by-one.

Math Problems

I’ve enumerated problems with this test already elsewhere here. But, here’s another source which treads the same ground,

Sensitive questions

If you actually read through the “sensitive” questions, you’ll quickly see they’re just bigoted, deeply offensive questions—of course they’re being filtered.

Code generation

Their entire test was whether or not the entire response was code which could be run as-is. The problem with that is that gpt-3.5 and gpt-4 are models which have been fine-tuned for chat. They’re chatty and getting more so. With the newer models the authors counted code generation as a fail if the response was formatted with markdown. So,

def pythagoras(a, b):
    if a < 0 or b < 0:
        return None
    else:
        return (a**2 + b**2)**0.5

Would pass but,

```
def pythagoras(a, b):
    if a < 0 or b < 0:
        return None
    else:
        return (a**2 + b**2)**0.5
```

would not.

When the requested code is extracted from within the fencing, the newer models outperform the earlier models.

Source:
Deceptive definition of "directly executable" code · Issue #3 · lchen001/LLMDrift · GitHub

Visual reasoning

The authors claim this metric has improved.

So, we have one deeply flawed test, one which is exactly as we should expect, one where they choose to measure something incredibly minute and not reflective of what they claim, and one where the model improved based on their tests.

That’s not strong evidence in support of a decline in model quality or capability.

Beyond that, this paper is a pre-print (which is fine, lots of researchers publish pre-prints) it just means it has not undergone peer-review. After reading this paper as a research myself, my conclusion is this paper would never be published in any reputable journal because it lacks rigor and the tests they have conducted fail to isolate the effects they claim to be measuring.

If this were submitted as a project paper by a student in any class I TA, it would be a 6/10 if I were in a generous mood.

Topic		Replies	Views
Does anyone have any real proof that theres been in a degradation in GPT-4's performance? API gpt-4 , api	20	4410	December 15, 2023
Chat GPT 4 getting worse? API	8	5093	December 17, 2023
OpenAI Ships like no other Community chatgpt	4	894	January 23, 2024
The accuracy and quality of the GPT-4 vision model have dropped below 50 percent as of today API gpt-4	5	3814	March 10, 2024
GPT-4 through API says it's GPT-3 🤔 API gpt-4	18	21057	December 25, 2023

ChatGPT finding by stanford researchers

Math Problems

Sensitive questions

Code generation

Visual reasoning

Related topics