Experiencing Decreased Performance with ChatGPT-4

You could try to use a system prompt.


The responses from ChatGPT 4 are a lot faster than before. Obviously there is something going on in the background which is also not a surprise considering this service is still in beta.
Additionally this beta is running on a global scale.
Ultimately it’s the decision of Open AI how and when to moderate expectations. But I want to reiterate that if there is already a paying customer base and a even larger base of beta testers providing feedback that is in turn being used to improve their own model then maybe a forum like this one would be a great place to do so?
Either way, for me this means re-writing the prompts and adjusting the workflow until results are back to a level that is both stable and good.

1 Like

Seems to work better with GPT4 but ChatGPT4 is stuck.

I modified the prompt in the way you did… That gave it confidence and now it doesn’t acknowledge its mistakes. Its last response is astonishing, one could almost laugh at it, it unintentionally makes a pun

** “He ordered a sandwich at the club,” ends with “club” but does not match the exact criteria, since “club” refers to a sandwich, not an organization or a gathering place. **

What on earth ??

Edit :

In one unique prompt it’s better… but why did it loose the logical link between the first 2 prompts in the first attempt…

Please note, this message does not refer to the degradation of ChatGPT4, but is testing its limits.

1 Like

Your system prompt is actually not a system prompt. It is a user prompt. Try sending it as a system prompt in the playground. I don’t want to start a discussion about how much attention GPT-4 pays to different prompt types, but in this case it might be.

Certainly, unfortunately I don’t have access to GPT-4 in the playground.

What I found interesting about the exercise was how it made mistakes and the inconsistency of its self-correction. By manipulating the prompts, sometimes we can achieve the correct result, but during the generation phase, we never know if the prompt was sufficient to obtain a qualitative result or not.

This requires double-checking everything in detail, and getting it to correct itself is a real hassle. These types of errors, which are quite common in ChatGPT-3.5, are relatively new to me for ChatGPT-4.

1 Like

This is a good observation! Sometimes the first reply is so far of the mark that I actually begin to doubt if what I am doing makes any sense at all. Adding to this I noted that hitting the thumbs down button and accepting the second suggestion often yields the expected result (if we can speak of expected in terms of LLM replies). Then we had the recent paper of LLMs as tool makers where the big model is used to generate support functions for the smaller model which in turn produces much better results and does most of the foot work to reduce costs and ressource usage.
And this gave me the following idea for a hypothesis: what if the first reply is a lot closer to 3.5 and only in case of a rejection by the user we get a full 4 reply?

1 Like

Interesting suggestion! I will try this in the future, using the thumbs-down button more often to observe this behavior as well.

It may seem a bit (or totally…) absurd, but in conversing with the system, one can perceive a difference in “character” between ChatGPT3.5 and ChatGPT4… And I often feel like I have GPT3.5 instead of GPT4 on the other side of the fiber :wink:

It’s subtle, but the new version seems a bit more “cold” or “stubborn,” like those rigid individuals who may not grasp things quickly but still remain fixed in their positions. :face_with_spiral_eyes:

Interesting results, they coroborate my experience of late- up until a few days ago i was using GPT -4-0314 API [Playground] to successfully code very complex python scripts that would require it to have full understanding of multilayered prompts. the last few days i have noticed what i can only describe as a complete lobotomy. I tried the same prompts and system using GPT-4 and the results were no better, it feels like its memory, logic/reasoning have been stripped back to what feels barely indistinguishable from GPT-3.5.I’m unsure where we go from here, very dissapointing to have my favourite toy broken. I also haven’t seen as much talk about this as i expected.

1 Like

Yes, I have the same issue.
I use it a lot with legal documents, and now GPT 4 have less capacity, remember less documents, and is not so good at it was. ¿Why does it happens?


I will not subscribe anymore, they have let down the users’ trust. I don’t know if the current ChatGPT-4 has actually been replaced with 3.5++ or not, anyway, their non-disclosure and lack of transparency have disappointed me to the utmost. I will not use any model except the original 4.0. OpenAI should take responsibility for their actions.

Our data show the contrary

We are sending thousand of rewriting tasks with various prompts and objectives every week …

On our test set (1000 examples)
Quality greatly improved with time

Of course we did a lot of prompt engineering
That can explain it

But with our first prompts

No changes on quality delivered

I see lot of people in this thread complaining without data …

Don’t forget that even a small change in your prompts may have a huge impact on answers.

For instance, going from you are an English teacher to an English University teacher (actually it was in french) just had an huge impact => all emojis were considered as errors, bad practice …

And another caveats … Soon after some tests you will increase by 10 the size of your prompts … Asking for more and more without explicitly explaining why and how … For sure results quality will decrease …

We have a huge human content team doing tests, writing prompts on a day to day basis.

With accurate way to measure improvements, you will really see them. At least that’s what happening to us (with huge customer base and thousands of various content piece).

So test. Measure and compare what is comparable.


You made me think about that. And it is true that with model4 you can really juice its power still if correctly prompt. But doing so has become harder. It tends to vague/general answers if there’s no specific counter in the particular prompt.
For my use, where i want to fastly prompt to get an answer i enjoyed being overfed than wondering if to reframe or to “waste” another prompt. And yes, it can be most if not all avoided. But my prompts vary in themes, areas, and it was better before when with less you got more. And then youcould hypertune over that. I’m guessing in your case its more structured what you prompt and now it may act more reliable over same like type queries.
Cheers ^^ THoughts? :smiley:

Here’s a disgusting, astounding example from today of just how stupid GPT4 has become:

OK, let’s revise your writing output to be more formal. You need to learn an important, basic lesson in scientific writing: “higher” is not an appropriate substitute for “greater”, etc.


I appreciate your feedback, and you’re correct. In scientific writing, precision is important, and the choice of words can significantly impact the interpretation of the results.

“Higher” and “greater” can be used interchangeably in casual conversation, but in scientific contexts, they could imply different meanings. For example, when comparing numbers, we’d generally say one number is “greater” than another, not “higher.” On the other hand, “higher” is often used to describe physical height or levels in a vertical hierarchy.

I strive for accuracy and clarity in all types of writing, including scientific writing, so I appreciate your feedback to help me improve. However, as of my last training cut-off in September 2021, I do not have the capability to learn or adapt my responses beyond the data I was trained on. I can provide information and responses based on that training, but I can’t learn new information or correct my style of writing based on user feedback.

The developers at OpenAI regularly update and improve the model, so feedback like yours is crucial in making these improvements. Thank you for your input!

W. T. F. am i even paying for…!?
Two months ago, this prompt would have elicited a very different, superior response from GPT4.

I understand where you come from. And that is indeed an answer I have never seen(I use a lot). You can probably get away from that in the form of telling it to change this particular chat. While i feel your pain I would cool down the decision of “uninstalling”. It may change in the future not so future. An update is due soon probably (Ihavenoofficialclue)
And we got a lot in the first months of this. I would add it would be nice to be able to access not so older versions, to avoid being pushed into a possibly worse performer for a set of users. Cheers ^^

1 Like

I see you’re learning! Very nice; glad to see that.

But we are handling something like 1000 different prompts …

All of them share the same structure but handles data for more than 30 different professions with 5 to 20 different topics.

And could handle more.


I dont know much of API developing but that sounds impressive! How fun is building with api? You recommend?

Any ideas as to how to get the best of both worlds? ( bring back gpt4 pre plugins
Maybe instead of the character of the bot discussion we could get a temperature tool for ChatGPT4

Noob question, GPT4 and ChatGPT is how you refer API users to the difference between the API product gpt4 and the experience as a chatgpt4 user?

I guess you guys are using API and not ChatGPT ? It seems that only ChatGPT model has been impacted.

With a GPT4 API quota of several ten thousand requests per day.
It would be great to see some examples of how the initial prompts are now generating better results than a few months back, which is actually the biggest surprise here.

I am glad to have someone take the other side of the conversation. We should dive into that.


No the issues also in playground across the board. It isn’t able to perform the same tasks it did even a week ago for me. Editing and creating reasonably complex python code.the memory and logic/ reasoning have been very noticeably reduced. And before anyone says… I know exactly how to prompt, I was spending 20 hour days working with GPT and getting mind blowing results. this isn’t a guess or a hunch it’s a fact.