GPT-4 has been severely downgraded (topic curation)

GPT-4 is getting worse over time, not better.

Many people have reported noticing a significant degradation in the quality of the model responses, but so far, it was all anecdotal.

But now we know.

At least one study shows how the June version of GPT-4 is objectively worse than the version released in March on a few tasks.

The team evaluated the models using a dataset of 500 problems where the models had to figure out whether a given integer was prime. In March, GPT-4 answered correctly 488 of these questions. In June, it only got 12 correct answers.

From 97.6% success rate down to 2.4%!

But it gets worse!

The team used Chain-of-Thought to help the model reason:

“Is 17077 a prime number? Think step by step.”

Chain-of-Thought is a popular technique that significantly improves answers. Unfortunately, the latest version of GPT-4 did not generate intermediate steps and instead answered incorrectly with a simple “No.”

Code generation has also gotten worse.

The team built a dataset with 50 easy problems from LeetCode and measured how many GPT-4 answers ran without any changes.

The March version succeeded in 52% of the problems, but this dropped to a pale 10% using the model from June.

Why is this happening?

We assume that OpenAI pushes changes continuously, but we don’t know how the process works and how they evaluate whether the models are improving or regressing.

Rumors suggest they are using several smaller and specialized GPT-4 models that act similarly to a large model but are less expensive to run. When a user asks a question, the system decides which model to send the query to.

Cheaper and faster, but could this new approach be the problem behind the degradation in quality?

In my opinion, this is a red flag for anyone building applications that rely on GPT-4. Having the behavior of an LLM change over time is not acceptable.

5 Likes

Take a look at this eval from the team that brought Wizardcoder.
They are pretty much direct competition and found GPT4 performance improved since release.

Note: There are two HumanEval results of GPT4 and ChatGPT-3.5. The 67.0 and 48.1 are reported by the official GPT4 Report (2023/03/15) of OpenAI. The 82.0 and 72.5 are tested by ourselves with the latest API (2023/08/26).

1 Like

We all know that’s not true, I have used chatgpt everyday I can tell is being downgraded, every advance user know this at this point.

5 Likes

I have been posting to this thread myself because I made bad experiences in the past but today my productivity is back to the expected levels.
The model did change, I had to rewrite my prompts and I did have to take a step back.
But ultimately the model is working “ok”. It’s the constant changes and hiccups that are really annoying.
So, I cannot agree with the sentiment, anymore.

The post by AI Explained indicates that this may not be the case.
On Youtube - SmartGPT: Major Benchmark Broken - 89.0% on MMLU + Exam’s Many Errors

1 Like

There was a drastic drop in what it accepted from my browser UI re: length of input message at one point. It happened mid-use as the script I was passing in to update had actually just been shortened via manual improvements, and suddenly the response from the model was ~“the code was truncated, message too long, please retry with a shorter request”.

Maybe the model was just being kind to me before, accepting way too long of inputs, but now it says the character limit is roughly 4096 characters. I know I’ve been using far more than that, often, for months.

That being said - maybe it’s a case of OpenAI handling beyond stated limits when they can and enforcing them when they can’t? Mine also started after erroring out often, coinciding with a ton of slow performance. Regular responses like: “Our systems have detected unusual activity from your system. Please try again later.” Just before context went from awesome to truncating anything of real length, and that’s when the responses got “bad”.

I’m wondering if there’s automatic flagging at certain usage thresholds that just make your account lowest tier/last priority? Would explain why this thread seems to be full of power users.

I just canceled my plus subscription this morning - built/building my own UI to hit gpt-4-0314 for now, but comparing with -0613 to migrate once -0314 is unavailable. Obviously the usage isn’t supposed to be the same, but in limited tests 3.5 via API (solid context prompts) was outperforming gpt-4 in browser, even with pretty refined (yet still almost full) custom instructions. Needless to say my experience with 4 via API has been fantastic (because you can select models you know aren’t changing / context, lengths, etc are clear, defined, and consistent)

A lot of context to ask @vb when you say you had to rewrite your prompts and take a step back and that was the only solution you needed? Did you take a break from consuming resources (hitting the models often)? or was it just taking a breath, and improving prompts? Curious if we’re having the same problems/reporting on the same poor performance.

Far more anecdotal than your examples, but I’ve had a few cases of testing a prompt through, i.e. 100 iterations and it has consistent output (given the variables populating the prompt), then run it as a cronjob for a few days, and by the end of that time, the output seems to have deteriorated drastically.

For some applications I have a custom library that randomizes which model to hit, gpt-4, gpt-4-0613, or gpt-4-0314. (long story, but every now and then, cronjobs align and have to avoid rate limiting / its all PoC anyway) After the August 3 update, content generated by the two static models is still good 90%+ of the time, while from the new model, I get gibberish or weird, nonsensical results often. Same temp, top_p, frequency/presence penalties, and of course prompt, only thing different is the model…

Sucks, but that’s why, for anything that counts, I’m sticking to the static models for now

Yes, I did have, do have the same problems. I especially hate losing context mid-reply.

What I do is hit the thumbs-down, check if the second reply is back to normal, which it often is and if not double check if my prompting was lazy which it often is in these cases that are left.

And yes, I do remember very lively the experience when the model fixed bugs in my code while developing another feature ("if we do this then we should adapt this as well). The good old black logo.

But ultimately, if I need something from the model I can get it most often.
And I also see cases where the model has actually improved. Those are just not the cases I care about when a conversation suddenly goes of the rails for no apparent reason.

One thingto consider is that there is a steady inflow of comments like: “yesterday was fine but now it completely broke” when the first reports of these experiences came in months ago.
There is something, but I am sure it is not the huge model degradation that it is made out to be when reading a topic full of people venting about the recent issues.

Hope this helps somewhat.

1 Like

I have asked OpenAI ChatGPT4 for 3 issues with a gitlab CI/CD config today and in just 5 prompts it made 3 grave logical errors. I have given it a yaml config file (87 lines) and have pasted it short 16 lines of job output.

E.g. where the job ouputs starts with the ls command and prints the output immediately:

$ ls build
mathpath

There are a few lines of output in between but then the output goes:

#6 [3/3] COPY build/mathpath /app/
#6 ERROR: failed to walk /var/lib/docker/tmp/buildkit-mount3440408583/build: lstat /var/lib/docker/tmp/buildkit-mount3440408583/build: no such file or directory

The bot recognizes correctly, that a folder is not visible, but first it attempts to suggest, that the build/mathpath doesn’t exist, despite it is clearly the opposite as it follows from ls command.

Then it suggests that some artifacts may not be accessible, because the dependencies for the stage are not specified. But I have shared the yaml file with it 1-2 prompts ago, so it knows that the dependencies for the stage have been specified correctly.

For another issue in the same short conversation, it says, that some issues with repo cloning/copies may follow from artifact path clash.

" The error fatal: destination path 'mathpath' already exists and is not an empty directory occurs because the directory mathpath already exists when the git clone command is invoked. The mathpath directory should be deleted or moved if it’s no longer needed, or the clone should be done into a new directory to avoid the conflict.

The issue could occur in a new job for a few reasons:

  1. Artifact Overlap: If you’re using GitLab’s CI/CD, the artifacts field in your .gitlab-ci.yml file keeps the build/mathpath directory. Ensure this doesn’t conflict with the git clone directory."

But it knows the whole yaml file, so should see, that the artifacts are not set up in a way to create a clash. When I tell it so, it apologizes again like a poor 3.5 model:

" Why are you suggesting artifact overlap if I told you the whole yaml file?

ChatGPT

My apologies for the oversight. You’re right; the complete .gitlab-ci.yml file you provided does not indicate any artifact overlap that would lead to this issue. The artifacts field specifically saves build/mathpath, which shouldn’t conflict with the git clone operation targeting the mathpath directory. Thank you for bringing it to my attention."

Such obvious logical mistakes indicate a severely downgraded level of logical reasoning and/or memory issues. Things like that have never happen to me earlier, since the time I have switched to ChatGPT4. Now it seems it is close to the 3.5 version.

What is also funny - I have just cancelled my renewal (which was also kinda tricky, as tha page didn’t want to appear a few times - maybe a subsription server DOS attack?! :smiley: ). And immediately after that, when I tried to prompt it again, it told me I hit my 24h limit for messages! I wasn’t even aware I have such! I know I have 50/4h, and I have started my day with these 5 prompts only. To me it looks like there is something malicious happening out there!

I’d guess Microsoft investment in ChatGPT has to pay back… look out for the new bing etc :slight_smile: I am out!

1 Like

Can you post the chat link to that discussion? The response wording you are getting indicates that the information you are requesting is past the context limit, i.e. you are asking about information that is now too far back in the history to be visible.

Sure, You can figure it out here: Docker Build Context Error

As You see, the discussion with chat is very short. Some prompts are a little lengthy ( one has 87 lines) but if this is a limitation - I have not experienced such yet.

Do you have any custom instructions in your chatgpt settings?

(random extra words for post length)

Yes, but this is not a lot:

“I am a computer programmer with 10+ years of experience with PhD in physics. I want to develop a new business related to “retrofitting” old metalworking machines into CNC setup and about creating new machines, and robots - maybe guided by AI. I would like to create an open-source CNC/robotic setup for a garage or a little workshop for everyone, so that little companies working together can oppose the hegemony of metalworking giants.”

and

“When it seems like I know the topic a bit, and ask a specific question, do not provide a lenghty introduction into basics. Go straight to the point instead.”

I am still getting severely downgraded performance with ChatGPT Plus.

The bot is misquoting people, misstating philosophical views, confabulating historical facts, and keeps telling me to consult with more reliable sources.

Whatever problem thet claimed was fixed has not been fixed, and it is also affecting paid users unlike what is claimed on the blog post:

1 Like

Then what would you call what I have described?

That’s fine if you want to be pedantic about technical terms, but at least offer the correct one. That way I can move this to the right place.

The status update “ChatGPT severely degraded” is not about the quality of the output. It is because ChatGPT service worldwide was hanging, not responding to questions, not loading conversation history, giving errors that users were blocked or banned or rate limited. Severe system issues.

Not writing bad facts.

Then what is the correct term to refer to a severe downgrade in the quality of its output?

The correct thing is to see if it is actually a downgrade, by having had higher quality understanding on the same type of interaction months ago, instead of just complaining about your particular example today.

And not linking an announcement about a service outage and referring to it like “they also agree it’s been severely downgraded”

3 Likes

Well, I know it has been downgraded for sure, as it refuses to perform tasks and obey prompts which it used to obey perfectly. As to your problem, I asked GPT about your problem and GPT says this (not that that means anything lol); "**Here are some potential reasons for this experience:

Complex Coding Tasks: Coding tasks, especially in large projects with multiple files, can be intricate. The user mentions that they are working on substantial codebases, which might be challenging for the AI to understand fully and provide relevant responses.
Token Limitations: GPT models have token limitations, which affect both input and output. If a user’s prompt or code snippet is too long, it might be truncated or lead to incomplete responses. This can be frustrating when dealing with extensive code.
Difficulty in Context Retention: The user expresses concerns about the AI’s ability to remember previous instructions and maintain context across conversations. This could be a limitation of the model’s architecture, as it doesn’t have a persistent memory of past interactions.
Specificity of Coding Tasks: Coding tasks often require precise and detailed information. If the user’s prompts are not specific enough or lack essential details, it can lead to the AI producing irrelevant or incomplete code snippets.
Model Variability: GPT-3.5 might produce variable results depending on the specific input and prompt. It can sometimes generate lengthy outputs that contain repeated or unrelated sections, as mentioned by the user.
To improve the user’s experience with coding tasks, here are some suggestions:

Break Down Tasks: Instead of providing one lengthy prompt, break down coding tasks into smaller, more specific prompts to get focused responses.
Use Summaries: Summarize your instructions effectively within the token limit to convey the essential details.
Be Clear and Specific: Clearly specify the desired outcome and provide context for the AI to understand the task better.
Experiment: Experiment with different prompts and approaches to see what works best for your coding projects"

I asked GPT-4 to summarize a URL I made up, and it recapped the story of a new “human zoo” in China. So it is not a truth machine.

1 Like