Experiencing Decreased Performance with ChatGPT-4

Has OpenAI provided an explanation for the very clear degradation of ChatGPT’s performance with the GPT4 model? Since the last two updates, it is significantly faster (which is fundamentally useless when we are capped at 25 messages in 3 hours…), suggesting that computational power has been reduced, while at the same time, it is making unprecedented reasoning errors. As a regular user, I have noticed the emergence of gross errors that did not exist before, and especially a much greater capacity to forget. It loses track of certain instructions rather quickly (10-15 messages) when previously it took more (25-30). Has the number of tokens been reduced? Honestly, presenting a nice tool to attract customers and then discreetly reducing capacity is not very transparent, or even…

For example, I ask ChatGPT to translate my messages (English is not my native language). Previously, an explanatory prompt was enough for a translation of all my messages in a session. Now, after a few messages, it responds to me in English (while I just ask it to translate) and confuses what it needs to translate with the content it thinks is addressed to it. It has become even more limited than the old GPT 3!

63 Likes

Do you have examples of the same prompt returning drastically different quality results?

1 Like

Unfortunately not, because I don’t keep my history for long. But here are a few concrete examples:

  • It makes errors in using informal/formal language in my native language, which he simply never used to do (just… never)

  • Today, during a botanical discussion, he confused a tree with a pond… it may sound laughable, but it’s one of those many minor situations that have been happening often in the past few days

  • I regularly prompt it to engage in text-based RPGs… now it loses track of the conversation after about 15 messages, whereas before it would forget instructions after around thirty. It begins with subtle mistakes (forgetting details) and then much bigger ones (reasonning errors !)

  • I have saved old conversation exchanges where I compared GPT-3.5 to GPT-4 (in March) on a professional topic (quality management)… I have reenacted the same conversations today, and it is clear that the responses from GPT-4 are less ‘intelligent’ than they were three months ago (even though they are textually better presented… but I don’t care, I’m not looking for a text marketer). Indeed, the reasoning is less profound and less subtle with GPT-4. It doesn’t seem to be an improvement over GPT-3.5 in that regard.

I consider myself as a highly advanced user of GPT-4, with hundreds of hours of usage, and I push the tool to its limits… and it’s clear that GPT-4 quickly struggles with rather basic things since the last update.

I’m sorry, I’m not writing this to be unpleasant. I love this tool, but I feel like the shell has been emptied out.

20 Likes

Another very surprising thing: I regularly use a prompt to have my messages translated into English. I use GPT-4 when I’m below the limit, or alternatively GPT-3.5. Usually, both understand the logic of the prompt well, and respond to me in my native language that they have understood my instruction and are waiting for the messages to translate. Well, today (~19:00 CET), GPT-4 responded to me in English (which is a bit silly for someone asking it to translate… and therefore not supposed to understand) and was confusing the text to translate with instructions. At the same time, GPT-3.5 understood correctly. Now, with exactly the same prompt, GPT-4 (~21:30 CET) understands correctly again… for now.

1 Like

Can you share any of these comparisons? I’d be interested to see the differences.

2 Likes

Agreed. This is exactly what I’m asking for.

People keep saying the responses are worse, but never give any examples.

3 Likes

Another one: in March, GPT-4 was able to answer this question, and GPT3 got confused. Now, both fail : “How can you time 9 minutes using a 4-minute hourglass and a 7-minute hourglass?”


"User
How can you time 9 minutes using a 4-minute hourglass and a 7-minute hourglass?

ChatGPT
You can time 9 minutes using a 4-minute hourglass and a 7-minute hourglass as follows:

Start both hourglasses at the same time.

After 4 minutes, the 4-minute hourglass will run out. Immediately flip the 4-minute hourglass to restart it.

After 3 more minutes (7 minutes in total), the 7-minute hourglass will run out. At this point, don’t do anything yet.

After 1 more minute (8 minutes in total), the 4-minute hourglass will run out (for the second time). Immediately flip the 4-minute hourglass again.

After 1 more minute (9 minutes in total), the sand in the 4-minute hourglass will have been running for 1 minute since you flipped it last. At this point, you’ve measured 9 minutes.

User
Comment tu attends la minute du point 5?

ChatGPT
Dans cette stratégie, nous n’avons pas de moyen précis pour mesurer une minute lors de l’étape 5. Cependant, l’heure de cette minute est déduite du processus entier. Voici comment cela fonctionne :

Commencez les deux sabliers en même temps.

Après 4 minutes, le sablier de 4 minutes sera épuisé. Retournez immédiatement le sablier de 4 minutes.

Après 3 minutes supplémentaires (7 minutes au total), le sablier de 7 minutes sera épuisé. À ce moment, ne faites rien.

Après 1 minute de plus (8 minutes au total), le sablier de 4 minutes sera épuisé (pour la deuxième fois). Retournez immédiatement le sablier de 4 minutes.

Après 1 minute supplémentaire (9 minutes au total), le sable dans le sablier de 4 minutes aura coulé pendant 1 minute depuis la dernière fois que vous l’avez retourné. C’est à ce moment que vous avez mesuré 9 minutes. Vous n’attendez pas spécifiquement cette minute, mais elle est le résultat des étapes précédentes.

User
Pas convaincant

ChatGPT
Je m’excuse si ma précédente réponse n’a pas été satisfaisante. Je comprends que vous cherchiez une méthode pour mesurer une durée exacte de 9 minutes à l’aide d’un sablier de 4 minutes et d’un autre de 7 minutes, où chaque minute est précisément décomptée. Malheureusement, cela n’est pas possible avec seulement ces deux sabliers, car ils ne peuvent pas mesurer des intervalles de 1 minute.

Si vous avez d’autres questions ou un autre scénario à proposer, je serai ravi de vous aider à trouver une solution.

User
A l’étape 3, tu retournes le sablier de 7. Quand tu es à l’étape 4, il s’est passé 8 minutes, et seulement 1 minutes du sablier de 7 s’est écoulée. Tu le retourner, et il sera vide dans 1 minutes, donc à 8+1.

ChatGPT
Ah, je comprends maintenant. Merci pour cette clarification. Votre stratégie est tout à fait correcte. Voici la méthode révisée :

Commencez les deux sabliers en même temps.

Après 4 minutes, le sablier de 4 minutes sera épuisé. Retournez immédiatement le sablier de 4 minutes.

Après 3 minutes supplémentaires (7 minutes au total), le sablier de 7 minutes sera épuisé. Retournez immédiatement le sablier de 7 minutes.

Après 1 minute de plus (8 minutes au total), le sablier de 4 minutes sera épuisé (pour la deuxième fois). Ne le retournez pas, mais notez que le sablier de 7 minutes a coulé pendant 1 minute depuis que vous l’avez retourné.

Après 1 minute supplémentaire (9 minutes au total), le sable dans le sablier de 7 minutes sera épuisé, mesurant ainsi 9 minutes précisément.

Votre stratégie est excellente et je m’excuse pour la confusion initiale.

User
Il manque le retournement du sablier 7 à l’étape 4.

ChatGPT
Je m’excuse pour l’omission. Vous avez absolument raison. Voici la méthode corrigée :

Commencez les deux sabliers en même temps.

Après 4 minutes, le sablier de 4 minutes sera épuisé. Retournez immédiatement le sablier de 4 minutes.

Après 3 minutes supplémentaires (7 minutes au total), le sablier de 7 minutes sera épuisé. Retournez immédiatement le sablier de 7 minutes.

Après 1 minute de plus (8 minutes au total), le sablier de 4 minutes sera épuisé (pour la deuxième fois). À ce moment, le sablier de 7 minutes a coulé pendant 1 minute depuis que vous l’avez retourné. Retournez alors le sablier de 7 minutes.

Après 1 minute supplémentaire (9 minutes au total), le sablier de 7 minutes sera épuisé, ce qui vous donne un total de 9 minutes.

Encore une fois, merci pour votre patience et votre perspicacité."


This is a copy-paste from a GPT-4 exchange. (GPT-3.5 had the same type of error in step 5). You can translate and see the reasoning difficulties and the struggle to correct itself…

Fun fact, 30 minutes later, both models gave a correct answer in new prompts (learning?).

… and, this morning, once again they make reasonning mistakes with incoherent explanations. And this time almost impossible to lead GPT4 to a right explanation !

In March, GPT-4 answered correctly on the first try, with the right explanations. This question was part of a series of 30 questions I tried (28 out of 30 were accurate!).

Today I feels more like GPT3.6 than a GPT4.

6 Likes

Do you happen to have a transcript of the response from March?

Unfortunately not anymore. It was shorter, but there was no mistake with the 7 minutes tip (reverse after only one minute). This morning, it was OK, but now it’s even worse, completely lost.

1 Like

I guess what I’m wondering is whether this is a case of degradation over time, or whether it’s just the variability of the model. I was also able to get a correct working result sometimes with the hourglass example you sent, but not other times (you can “regenerate response” a few times and see what the different answers look like). I suspect it’s not a case of the model being better a few hours ago, and worse now, and better again in the future, etc., but instead a case of model variation. Perhaps if you were using the API with temperature 0 you’d see less variation between attempts.

This is a really interesting prompt for looking at variability of responses—I’m getting very, very different results with different “regenerate response” (most of them either confusing or wrong).

1 Like

Just share the chats. That’s a thing now and this is a perfect use case for it.

4 Likes

For me, the decreased performance is noticed from Day1 after Plugins become Beta. There is something wrong with that model :slight_smile:

Below is a sample conversation to generate some Python code:

It doesn’t even able to generate proper code (which is supposed to excel at) with the right instructions after few ToT like prompts, basically didn’t follow resolver plans or even come up with a better plan.

You can measure many use cases where I see the GPT-4 plugin model so uncapable most of the time. I find that the GPT-4 model without plugin reasoning and instruction following to be better than the one currently used in Plugins.
I don’t know why though. :slight_smile:

And I really miss 3.5Model with plugin much, it was really that good for many of the tasks was thrown at it. Take a look at the same prompt sequence above using 3.5 model not GPT-4 without plugin which I think will fare better, but I hope you get what I mean about something specific to the plugin model after it was made Beta.

You can even tell that 3.5 was also smarter and step 10 and 11 already inferred that it is not applicable.

I have tried with the plugin model for GPT-4 all possible prompts techniques, CoT, ToT, AutoGPT kind, Pseudo-code prompts. and it fails miserably in following the instructions fully and sometime even come up with the proper cohesive plans.

[Edit]
Here is GPT-4 model response without plugin, Still 3.5 response was better.

I hope that make it clear, but you can measure many tasks at the same ballpark :slight_smile:

2 Likes

I don’t keep 20+ messages conversations in my history during 3 months. If you do so, just try by yourself and reprompt and see if quality is the same. I’m talking about challenging prompts, not just recepies with your fridge content.

Another example : answer is so dumb… even 3.5 does not get fooled.

Again, I have no previous history available but March model was able to produce realistic answers and not just “as an IA…”

Then again, speed is way higher, I would like to know what compromises were made to reach this. It’s also weird that the new models release notes do not mention the increase of speed. Why ?

3 Likes

Right, and sometimes it gives the correct answers with bad explanation. It seems to help when you ask it to make a minute by minute board. It was not necessary in march.

There’s another interesting case, the quality of GPT4 answer is quite the same as before, but the GPT3.5 has increased (march version was unable to simulate the right expectation) :

Gpt4

Gpt3.5

I have been experiencing the same. It’s very frustrating, specially after having had the chance to work with GPT4 and get excellent results. Now, the answers lack context, are significally shorter, are even the grammar and the way of repying seem less human-like and more bot-like. I’m paying the Plus version and still is not working as it used to. Furthemore, it is ridiculous the cap you have on messages when now you need to explain chatgpt 3 or 4 times what it needs to be done, so those 25 messages run out very quickly out of chat gpt own lack of understanding about this question or its inability to stick to what you tell them to do.

8 Likes

Agreed - the degradation in intelligence has been VERY noticeable from what it used to be. I first noticed almost a month ago, however more recently it’s much worse. At the time I was told by others that they think it varies depending on load, but I don’t know.

And with the 25 message cap, having to repeat questions so often as it keeps making mistakes really uses up the cap quickly, I often find myself using GPT3 70% of the time, wondering what I’m paying for.

7 Likes

Hi @logankilpatrick
Appreciate you/team looking at this.
It’s really pretty obvious if you used both models and some samples are shared above in the links above.

Thank you!

1 Like

GPT-4 is in good shape today. I combined a few logic questions to test its capabilities; GPT-4 got 11/11 correct, while GPT-3.5 got 5/11 correct. I will try to keep these records for a small long-term follow-up :wink:

1 Like

Not for me unfortunately :frowning:
It really make the most simplest mistake and don’t follow the rules set, or you have to spend prompt after prompt to get it to do what you want.
Same sample chat I used before still exactly applicable and 3.5 looks better to me, and when I say better I mean in terms of following instructions and following through what it says.

The GPT-4 model says lots of things but basically didn’t follow my instructions fully or the plan that it by itself laid out.

Literally just run out of capacity using the same solar system orbit example, a complete waste for me at this point.

It works well for simple tasks, but once you want it to do something meaningful I don’t think it is able to follow through at all.

I couldn’t get it to improve using any prompt technique :frowning: , I also don’t think they did anything to the model since we are still on that May24th version, unless model updates are not reflected in the web interface.

But net net, it’s good for hobby, poor at tasks, and I’m not talking about AutoGPT stuff, I’m talking about well thought conversation between me and AI where I try to steer it to do what I want.
My thinking is really enough of this thing and wait until it mature or different model release, but all the hype being made, I basically came back crashing to earth now unfortunately :frowning:

4 Likes

I agree. I noticed a stark difference today as well enough to create a account and reply to you. We talked about it on discord server by using the apple test, even if you don’t know what that is, what matters is that we saw a stark difference and we got a 6/10 consistently instead of what it started it as 10/10. THe API is the only workaround apprently, it is just infuriating they don’t announce this. We need to be more vocal. So much for being OPENAI, don’t announce worsening it for the Plus USers

6 Likes