GPT-4 has been severely downgraded (topic curation)

They will go to their grave with their measuring stick in hand (not a threat). I’m sure there’s some merit in his words but the fact that they release no information regarding their newer models and ChatGPT functionality makes the comment so completely out-of-touch and tone-deaf.

Was it ever even confirmed that GPT-4 is actually using a Mixture of Experts? There was a really good article exploring this but I can’t find it anymore.

Here it is:

Makes one wonder why there is still no logprobs :thinking:

EDIT: oooh there was another article released that I didn’t read

1 Like



Here is an example of it suddenly forgetting what it was doing, and admitting it.

It was working okay up just now, getting into 3pm West coast time, boom it’s no longer performing with the memory it had been.

Then just now it seems to have completely forgotten the task it was working on and “made up” a new one completely off topic. This wasn’t anything to do with anything I asked, changing Rate of Audio playback. I asked for it to find the issue in the code, it was making changes overall.



This error looks familiar, happens during these amnesia moments.

So it will have some sort of “breaks of continuity and context” over and over again during some periods of the day vs. less so other times. Usually following peak times of West Coast USA.

I have found the upload file feature is becoming useless, like a tiny hole of vision over the file vs. pasting into the prompt. Yet it works better in code mode than in plugin mode where things really seem wonky at times now, since I can’t paste in a file to help it. Yet both paste + upload together sometimes are better, other times make it completely break too. Simply keeping context would save it so much in resources, since I just keep hammering it till it finally gives me the money :P.

1 Like

It would be more helpful to everyone if you could just share a link to the chat.

One thing I’ve noticed with Code Interpreter is that occasionally the environment can be reset mid-chat—it’s rare but I can confirm I have observed it.

Often when that happens the model seems to go off the rails a bit as it knows it defined an object but it doesn’t exist. So suddenly it has an internal representation of the environment that is completely detached from reality. So, it’ll try to run some code which it thinks should work, get an error about an undefined object, define that object then get another error about a different undefined object, rinse and repeat.

This can chew up a lot of context if the error messages are quite long, which can lead to instructions and the model’s internal representation getting pushed out of context.

From there things can rapidly cascade catastrophicallyb out of control.

But, again, this is (to the best of my knowledge) very rare.

1 Like

Ah I see this frequently in bursts, where it happens almost every other prompt. Comes and goes, yet anytime I really go for a long time it starts to happen. I do suspect it seems changing to a new context / session may clear it up for awhile. Also it may be when I really jam it full of data.

1 Like

If you could isolate any commonalities for when it happens you could submit it as a bug and maybe get a bounty,

It would also be a big help to everyone if OpenAI were able to ameliorate the problem.

2 Likes

I am sharing two chats (in Dutch) of a plugin I am developing. If you look at only the chat from August (today), the responses might seem ok. But on close inspection of the GPT response where an answer (by me) on an exam question is judged, July gave a brilliant and useful explanation what I did right, the mistake I made, what was going on and how to correct. August also provided words, but either the words are just repetitions of what’s been said before, and when it judges my answer, made errors in judging what I did right and didn’t explain why the final answer didn’t match.

Note: the prompts the plugin uses have changed slightly in the meantime - I do believe that the perceived differences described below are more due to differences in the underlying gpt-4 model, than changes in the prompts. Results vary from time to time and seem unrelated to the prompt changes.

July: (the first one, I think of somewhere ~ july 11)
https://chat.openai.com/share/98888035-982c-4476-8d27-eda85d9e0a67

I saved this one with !! in the name because I was so impressed with the results - couldn’t think of anyway how to improve it. It was just perfect and I developed fuzzy warm feelings for this gpt :slight_smile:

August: (second one, today)
https://chat.openai.com/share/f85ab6c1-e582-42ee-b051-5f6b8fa9313b

The most notable difference in quality of the GPT response is the response to my answer/calculation “Ik denk dat het zo moet:
C5 = 500000 * (1 - (1+0.07)^-5)/0.07
= 2050098 euro
Daar moet dan nog de initiele investering vanaf - 1700.000 = 350098 EUR
En dan nog de restwaarde erbij +300.000
Eindantwoord NCW = 650098 EUR”

My answer/calculation contains a single ‘bug’ and that is that I haven’t calculated the future value 300.000 eur back to its value today (devide by 1.07^5)
Both versions fetch the ‘correctievoorschrift’ (gold answer and points) correctly.

July has an amazing response:

  • it correctly shows how the value of €2.263.994,57 is calculated (this exact calculation appears in the correctievoorschrift) and especially the last term in the left hand side is useful to show to the student, because it is different due to the 300.000 added.
  • It very precisely tells me what I did right, but also wrong, and why, and then proceeded to explain thoroughly what should have been done instead.

Aug response:

  • It reiterates my answer - but no additional info or insights - so no perceived added value.
  • It also has the tendency to create subsection headers - I am not fond of this personally.
  • Then continues with the gold answer, but with the end result only. The way the answer is calculated is NOT shown to the user.
  • It then compares on the end result only, and states that my answer does not correspond to the gold answer.
  • It then says that I added the 300.000 correctly (but I didn’t do it correctly, because I forgot to devide this value by 1.07^5) and then continues that the my final answer is wrong because it doesn’t match with the final answer from the ‘correctievoorschrift’.

In the remaineder, another thing July did better:
My question: “Als de cashflows gelijkmatig verspreid ontvangen worden, kan ik toch wel de totale investing delen door de jaarlijkse cashflow? ik krijg dan het aantal jaren in een aantal decimalen. Dat moet ik dan omrekenen naar maanden ofzo.”
July: provided a correct answer to this question
Aug: misinterpreted by question as an answer to question 17, and proceeded to fetch the gold answer, and give that answer in the response. This way, the student would’ve been prevented to come up with the answer him/herself.

The July version :rocket: and the Aug version :cry:

2 Likes

The post was mainly intended to respond in detail to the call for info in this tweet https://twitter.com/officiallogank/status/1695248712669483034?s=46

I can retry the chat a number of times (maybe also set the prompt to the original one), how many times would be satisfiable?

1 Like

I don’t know if you have experience developing a plugin, but what I would be interested in, is a way to test a plugin-chat 30 times without violating the chatgpt policies (As far as I know, using selenium or other browser based chatgpt wrapper is not allowed and could lead to a ban). Also, there is no api access to a plugin enabled gpt4 (and the api gpt4’s could be different)
So practically, 30 chats tested systematically is impossible.

But also, I do not think 10-30 chats is required for the specific call for info from OA (the tweet I replied this post to) As the plugin/prompt developer, I spend a lot of time repeating the same messages, and then you get a feeling what instructions the model can follow but also the limits. I don’t think the specific examples I shared are unrepesentive of the general feel of the responses at the time. The audience for this post is people at OA - maybe its useful, maybe not, but only they know what was changed when.

2 Likes

Do we have any updates on this?

It feels that v4 has 0 memory now.

What is going on?

2 Likes

Could this have anything to do with the suspected downgrade and lack of support for the regular folks?

If it is, it’s a bit disappointing. But, on the other hand, if it keeps OpenAI out of bankruptcy and still the darling of M$, then it is actually a good thing for us all.

1 Like

I use a brief prompt at the beginning of converaations that I expect to be on the longer side.

“Please make your responses brief and refrain from describing your limitations.”

It’s been very good at trimming non-essential, “conversational” fluff and doesn’t waste context on the paragraph you gave above.

1 Like

GPT-4 is getting worse over time, not better.

Many people have reported noticing a significant degradation in the quality of the model responses, but so far, it was all anecdotal.

But now we know.

At least one study shows how the June version of GPT-4 is objectively worse than the version released in March on a few tasks.

The team evaluated the models using a dataset of 500 problems where the models had to figure out whether a given integer was prime. In March, GPT-4 answered correctly 488 of these questions. In June, it only got 12 correct answers.

From 97.6% success rate down to 2.4%!

But it gets worse!

The team used Chain-of-Thought to help the model reason:

“Is 17077 a prime number? Think step by step.”

Chain-of-Thought is a popular technique that significantly improves answers. Unfortunately, the latest version of GPT-4 did not generate intermediate steps and instead answered incorrectly with a simple “No.”

Code generation has also gotten worse.

The team built a dataset with 50 easy problems from LeetCode and measured how many GPT-4 answers ran without any changes.

The March version succeeded in 52% of the problems, but this dropped to a pale 10% using the model from June.

Why is this happening?

We assume that OpenAI pushes changes continuously, but we don’t know how the process works and how they evaluate whether the models are improving or regressing.

Rumors suggest they are using several smaller and specialized GPT-4 models that act similarly to a large model but are less expensive to run. When a user asks a question, the system decides which model to send the query to.

Cheaper and faster, but could this new approach be the problem behind the degradation in quality?

In my opinion, this is a red flag for anyone building applications that rely on GPT-4. Having the behavior of an LLM change over time is not acceptable.

5 Likes

Take a look at this eval from the team that brought Wizardcoder.
They are pretty much direct competition and found GPT4 performance improved since release.

Note: There are two HumanEval results of GPT4 and ChatGPT-3.5. The 67.0 and 48.1 are reported by the official GPT4 Report (2023/03/15) of OpenAI. The 82.0 and 72.5 are tested by ourselves with the latest API (2023/08/26).

1 Like

We all know that’s not true, I have used chatgpt everyday I can tell is being downgraded, every advance user know this at this point.

5 Likes

I have been posting to this thread myself because I made bad experiences in the past but today my productivity is back to the expected levels.
The model did change, I had to rewrite my prompts and I did have to take a step back.
But ultimately the model is working “ok”. It’s the constant changes and hiccups that are really annoying.
So, I cannot agree with the sentiment, anymore.

The post by AI Explained indicates that this may not be the case.
On Youtube - SmartGPT: Major Benchmark Broken - 89.0% on MMLU + Exam’s Many Errors

1 Like

There was a drastic drop in what it accepted from my browser UI re: length of input message at one point. It happened mid-use as the script I was passing in to update had actually just been shortened via manual improvements, and suddenly the response from the model was ~“the code was truncated, message too long, please retry with a shorter request”.

Maybe the model was just being kind to me before, accepting way too long of inputs, but now it says the character limit is roughly 4096 characters. I know I’ve been using far more than that, often, for months.

That being said - maybe it’s a case of OpenAI handling beyond stated limits when they can and enforcing them when they can’t? Mine also started after erroring out often, coinciding with a ton of slow performance. Regular responses like: “Our systems have detected unusual activity from your system. Please try again later.” Just before context went from awesome to truncating anything of real length, and that’s when the responses got “bad”.

I’m wondering if there’s automatic flagging at certain usage thresholds that just make your account lowest tier/last priority? Would explain why this thread seems to be full of power users.

I just canceled my plus subscription this morning - built/building my own UI to hit gpt-4-0314 for now, but comparing with -0613 to migrate once -0314 is unavailable. Obviously the usage isn’t supposed to be the same, but in limited tests 3.5 via API (solid context prompts) was outperforming gpt-4 in browser, even with pretty refined (yet still almost full) custom instructions. Needless to say my experience with 4 via API has been fantastic (because you can select models you know aren’t changing / context, lengths, etc are clear, defined, and consistent)

A lot of context to ask @vb when you say you had to rewrite your prompts and take a step back and that was the only solution you needed? Did you take a break from consuming resources (hitting the models often)? or was it just taking a breath, and improving prompts? Curious if we’re having the same problems/reporting on the same poor performance.

Far more anecdotal than your examples, but I’ve had a few cases of testing a prompt through, i.e. 100 iterations and it has consistent output (given the variables populating the prompt), then run it as a cronjob for a few days, and by the end of that time, the output seems to have deteriorated drastically.

For some applications I have a custom library that randomizes which model to hit, gpt-4, gpt-4-0613, or gpt-4-0314. (long story, but every now and then, cronjobs align and have to avoid rate limiting / its all PoC anyway) After the August 3 update, content generated by the two static models is still good 90%+ of the time, while from the new model, I get gibberish or weird, nonsensical results often. Same temp, top_p, frequency/presence penalties, and of course prompt, only thing different is the model…

Sucks, but that’s why, for anything that counts, I’m sticking to the static models for now

Yes, I did have, do have the same problems. I especially hate losing context mid-reply.

What I do is hit the thumbs-down, check if the second reply is back to normal, which it often is and if not double check if my prompting was lazy which it often is in these cases that are left.

And yes, I do remember very lively the experience when the model fixed bugs in my code while developing another feature ("if we do this then we should adapt this as well). The good old black logo.

But ultimately, if I need something from the model I can get it most often.
And I also see cases where the model has actually improved. Those are just not the cases I care about when a conversation suddenly goes of the rails for no apparent reason.

One thingto consider is that there is a steady inflow of comments like: “yesterday was fine but now it completely broke” when the first reports of these experiences came in months ago.
There is something, but I am sure it is not the huge model degradation that it is made out to be when reading a topic full of people venting about the recent issues.

Hope this helps somewhat.

1 Like

I have asked OpenAI ChatGPT4 for 3 issues with a gitlab CI/CD config today and in just 5 prompts it made 3 grave logical errors. I have given it a yaml config file (87 lines) and have pasted it short 16 lines of job output.

E.g. where the job ouputs starts with the ls command and prints the output immediately:

$ ls build
mathpath

There are a few lines of output in between but then the output goes:

#6 [3/3] COPY build/mathpath /app/
#6 ERROR: failed to walk /var/lib/docker/tmp/buildkit-mount3440408583/build: lstat /var/lib/docker/tmp/buildkit-mount3440408583/build: no such file or directory

The bot recognizes correctly, that a folder is not visible, but first it attempts to suggest, that the build/mathpath doesn’t exist, despite it is clearly the opposite as it follows from ls command.

Then it suggests that some artifacts may not be accessible, because the dependencies for the stage are not specified. But I have shared the yaml file with it 1-2 prompts ago, so it knows that the dependencies for the stage have been specified correctly.

For another issue in the same short conversation, it says, that some issues with repo cloning/copies may follow from artifact path clash.

" The error fatal: destination path 'mathpath' already exists and is not an empty directory occurs because the directory mathpath already exists when the git clone command is invoked. The mathpath directory should be deleted or moved if it’s no longer needed, or the clone should be done into a new directory to avoid the conflict.

The issue could occur in a new job for a few reasons:

  1. Artifact Overlap: If you’re using GitLab’s CI/CD, the artifacts field in your .gitlab-ci.yml file keeps the build/mathpath directory. Ensure this doesn’t conflict with the git clone directory."

But it knows the whole yaml file, so should see, that the artifacts are not set up in a way to create a clash. When I tell it so, it apologizes again like a poor 3.5 model:

" Why are you suggesting artifact overlap if I told you the whole yaml file?

ChatGPT

My apologies for the oversight. You’re right; the complete .gitlab-ci.yml file you provided does not indicate any artifact overlap that would lead to this issue. The artifacts field specifically saves build/mathpath, which shouldn’t conflict with the git clone operation targeting the mathpath directory. Thank you for bringing it to my attention."

Such obvious logical mistakes indicate a severely downgraded level of logical reasoning and/or memory issues. Things like that have never happen to me earlier, since the time I have switched to ChatGPT4. Now it seems it is close to the 3.5 version.

What is also funny - I have just cancelled my renewal (which was also kinda tricky, as tha page didn’t want to appear a few times - maybe a subsription server DOS attack?! :smiley: ). And immediately after that, when I tried to prompt it again, it told me I hit my 24h limit for messages! I wasn’t even aware I have such! I know I have 50/4h, and I have started my day with these 5 prompts only. To me it looks like there is something malicious happening out there!

I’d guess Microsoft investment in ChatGPT has to pay back… look out for the new bing etc :slight_smile: I am out!

1 Like