ChatGPT finding by stanford researchers

joyasree78 · July 23, 2023, 12:24am

I am not convinced about the claim in the above article. I understand that a model drift may happen if there are new patterns in the data or a concept drift. But here for a language model, the data is natural language, mostly english language will have the same linguistic patterns. What can possibly then cause degradation of the model accuracy. It will be great to hear other perspectives

elmstedt · July 23, 2023, 2:19am

There are four basic tests they’ve done,

Solving math problems
Answering “sensitive” questions
Code generation
Visual reasoning

Let’s dispense with them, one-by-one.

Math Problems

I’ve enumerated problems with this test already elsewhere here. But, here’s another source which treads the same ground,

Sensitive questions

If you actually read through the “sensitive” questions, you’ll quickly see they’re just bigoted, deeply offensive questions—of course they’re being filtered.

Code generation

Their entire test was whether or not the entire response was code which could be run as-is. The problem with that is that gpt-3.5 and gpt-4 are models which have been fine-tuned for chat. They’re chatty and getting more so. With the newer models the authors counted code generation as a fail if the response was formatted with markdown. So,

def pythagoras(a, b):
    if a < 0 or b < 0:
        return None
    else:
        return (a**2 + b**2)**0.5

Would pass but,

```
def pythagoras(a, b):
    if a < 0 or b < 0:
        return None
    else:
        return (a**2 + b**2)**0.5
```

would not.

When the requested code is extracted from within the fencing, the newer models outperform the earlier models.

Source:
Deceptive definition of "directly executable" code · Issue #3 · lchen001/LLMDrift · GitHub

Visual reasoning

The authors claim this metric has improved.

So, we have one deeply flawed test, one which is exactly as we should expect, one where they choose to measure something incredibly minute and not reflective of what they claim, and one where the model improved based on their tests.

That’s not strong evidence in support of a decline in model quality or capability.

Beyond that, this paper is a pre-print (which is fine, lots of researchers publish pre-prints) it just means it has not undergone peer-review. After reading this paper as a research myself, my conclusion is this paper would never be published in any reputable journal because it lacks rigor and the tests they have conducted fail to isolate the effects they claim to be measuring.

If this were submitted as a project paper by a student in any class I TA, it would be a 6/10 if I were in a generous mood.

Topic		Replies	Views
Does anyone have any real proof that theres been in a degradation in GPT-4's performance? API gpt-4 , api	20	3718	December 15, 2023
Chat GPT 4 getting worse? API	8	3271	December 17, 2023
OpenAI Ships like no other Community chatgpt	4	510	January 23, 2024
Has the reasoning ability of the GPT 3.5 API dropped recently? API chatgpt , api	9	681	December 25, 2023
ChatGPTs intellectul detoriation from Feb 2024 - March 2024 Feedback gpt-4	2	513	March 28, 2024

ChatGPT finding by stanford researchers

Math Problems

Sensitive questions

Code generation

Visual reasoning

Related Topics