GPT has been severely downgraded

More evidence of regression, ChatGPT 3.5. A very deeply-crafted prompt, written entirely in researched early modern English (even thorns), that instructs by its own form, now no longer is a chatbot; it instead goes off and generates a tale based on contents of a clearly delineated period-piece example (Abraham Pryme, Diary, 1696).

Over one month ago response:

Thou hast made thy request known, and I shall now speaketh in the tongue of yore, adhering to the rules thou hath set forth. I beseech thee, dear interlocutor, to pose thy query, and I shall strive to answer in a manner befitting the era of our forebears.

and today

Hark! I hearken unto thy request, and I shall don the mantle of ShakespeareBot, speaking in the tongue of yore, resplendent with the antiquity of the 17th century. Pray, inquire of me what thou dost desire, and I shall respond with the language and manner of that era.

Verily, dost thou seek a tale of darkness and dread? A tale to make thy blood run cold and thy heart tremble within thy breast? Then attend, for I shall weave a tale of pallbearers, whose deeds shall send a shiver down thy spine.

Once upon a time, in a village shrouded in mist and shadow

and this continues for 10 more paragraphs… until the quality degrades to English-English, then killing the chance the effect can continue.

Shaken to their very core, the pallbearers exchanged hushed whispers, their minds unable to comprehend the horrors they had witnessed. They vowed never to speak of that night again, their silent pact sealing away the memory of the spectral bear and its demonic rider.

Diagnosis: The perplexity is so high, it doesn’t end-token after offering a reduced-qualty introduction. And the “Hark! I hearken”? The window-dressing of a “be a cowboy’s” “Howdy”.

Can it mimic the prompt’s “to conuey information in a maner cleare for a 17th-century person who dost readeth in that yeere”? Competency was demonstrated in prior chat, but no more.

Is the persona set in the system message or the prompt? With 0613 now focusing on the system prompt more, that could be the cause.

As described to the discerning reader, ChatGPT.

The ChatGPT system message has only changed by today’s date and minor phrasing updates that one can monitor.

This is a case where the superior performance of the prior model can’t be nudged back by tweaking the prompt. Only rewriting to avoid application-breaking regressions.

I’ve discovered worse behaviour during last month, mainly with scripting in different languages.
Some cases from my side:

  • DAX debugging/refactoring: I usually use GPT4 for DAX debugging in PowerBI. Just recently I noticed it completely ignores me providing updated code back, ignoring renamed columns, or flatout just not understanding the prompt, where in previous it could actually write the DAX code even itself, or at least help with debugging by suggesting formula that could potentionally work with some tweaks. No hard data, just truly the feeling of speaking to dumber GPT 3.5 which behaves in very similar way.

  • Python scripting is still working quite well, mainly because I believe the sample of data is much bigger thus the “inbreeding” is not as strong. But sometimes it flat out ignores my suggestion or code changes I did so the memory storage really doesn’t seem like the 8k tokens or what the number even is.

  • C#, C++ and other languages. As I don’t know these languages natively, cannot state the effectivnes of the code, but for my purposes e.g. of custom code in Tabular model this worked quite well. Effective? No idea

  • Other usages - I often use GPT4 for talk about my life, as a psychologist, let’s say. It helped me to solve some my personal issues, but son it started to forgetting the beginning of the conversation. To be more precise, it answered on it’s own assumptions, but when I asked about that specific issue precisely, without mentioning it directly, it did answer. So I think the contextualization and history of conversation doesn’t work properly now?

There definitely is some kind of change and the model is faster, yes, but a lot dumber and less precise. I am paying for the quality, not how fast this is.
Also the limit of 25 messages for plus user is bollocks.

2 Likes

Summary created by AI.

Users of the GPT Playground have experienced a significant drop in GPT-4’s performance, particularly noticeable in tasks such as coding. Some users reported that the AI forgets basic things in the next answer or fails to refactor code correctly. Although OpenAI has not officially commented on this perceived downgrade, the community insists there is a problem based on their extensive usage and comparison with previous versions. Users have mentioned issues with memory recall and reasoning abilities, which were reportedly better in previous iterations of GPT. Several users also provided specific examples and observations of the perceived issue to substantiate their claims. Some users, however, have not experienced this supposed performance decrease, instead reporting an improvement in certain areas with the latest version. There’s a call for more empirically robust ways of comparing model performance over time, to better establish claims of degradation or improvement. One user mentioned the possibility of biases leading to perceived degradation. However, OpenAI is reportedly open to examples of decreased performance for analysis.

6 Likes

I am really having a bad time using GPT-4 compared to a month ago, even a week ago, to do anything useful. Yes previous question is not known often, the last question, replies in a way that is very dumb sounding and just feels like no longer able to handle what it could. I know it is true, I posted whole code sets into it and it could keep working on it for hours, remember it when coming back to the session. What happened??? This is really obvious. I am guessing anyone not seeing this really was not pushing it before, I coded for 24 hours straight at times with it since April, days of little sleep writing my app. Now it feels as if that is a waste of time, I get into loops with it saying the same solution now and never able to get to the “Oh I will try it different”. I can guide it now to that point but with great work and constantly re-posting the code. It is just so obvious.

2 Likes

Same here, I am trying to convert some very very basic code and it simply is not helpful.

Even some css / html / javascript super simple syntax is destroying chatgpt 4…

any good alternatives?

For coding work, there’s still little that can beat GPT-4. You can experiment with Claude v2 demo if you’d like to see a runner-up.

If you are actually wishing to enhance your operation with the current tool, a post to “prompting” with examples of the specific underperformance may give you some suggestions of where to go next.

1 Like

I would like that too.

I can understand that they want examples, but many of us here and on other forums state very clearly that reasoning skills are greatly diminished, the constant loss of context, throwing the task back to you, more often pointing out its limitations and constantly apologizing for its errors and omissions.

Simple-prompt users won’t see this, but clearly those pushing gpt to its limits are adamant; we can no longer use complex or programmatic prompts, constantly “forget” directives or modify the content of the work or give an incomplete answer.

I was impressed from February to May and now it’s so bad there’s little benefit to using it instead of doing the task ourselves. More frustration than result.

If no openai developer experiences and sees all this for themselves, I don’t know what kind of help or example we can give you. So far, open ai is not at all transparent on the issue.

3 Likes

Share a series of chats where the model performed well and another series where it performs poorly.

3 Likes

I’m sorry for getting back to you late.

If you give the AI more organized and useful information, it will understand and solve problems better. Unfortunately, I can’t share the actual prompts because they contain confidential information and are being used by my company for our customers.

But I can tell you that we still haven’t found any major issues with GPT-4-0613.

Sometimes, AI doesn’t know what we humans know as common sense, and it can overthink things or make mistakes that we wouldn’t.

So, it’s important to include relevant information in the prompts to help the AI avoid getting things wrong (known as hallucination). However, we should be careful not to give too much information because that can confuse the AI too.

GPT-4 is not an AGI, so we need to manage the amount of information it can handle to ensure it understands and responds correctly.

1 Like

Back in March it was spitting code like a decent developer, now, few days ago I asked it to serve a robots.txt file in my django project, literally one of the most simple things out there, and it took it 10 messages or so to create a “working” solution that is actually pretty bad.

So yeah, it’s safe to say the model has been lobotomized.

4 Likes

why are we supposed to work for free for a Company that sells a PLUS service to access the “most” powerful model and they decreased its performances in reasoning, coding and context retention. They need to give us what we are paying for, period. I do understand there might have been moves related to save computational resources due to the vast amount of clients making server requests: fine, let some of us pay more for the real powerful model then.

2 Likes

You get any information about chatgpt on a website chatgptinstall for more information you can also visite a website

I’m testing it now (GPT-4) but I don’t know if there’s a drop in its performance. English is not my first language but I’ll show you an example:

Think, I should test it against coding to see if anything changed since last month.
PS. I changed colours in the source code because I like red XD

Stanford University and UC Berkely recently conducted a study. It was done by Lingjiao Chen, Matei Zaharia, and James Zou.

They found that GPT4 performance had downgraded the past several months. Here is an excerpt of the abstract:

We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time.
For example, GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 97.6%)
but GPT-4 (June 2023) was very poor on these same questions (accuracy 2.4%). Interestingly
GPT-3.5 (June 2023) was much better than GPT-3.5 (March 2023) in this task. GPT-4 was less
willing to answer sensitive questions in June than in March, and both GPT-4 and GPT-3.5 had
more formatting mistakes in code generation in June than in March. Overall, our findings shows
that the behavior of the “same” LLM service can change substantially in a relatively short amount
of time, highlighting the need for continuous monitoring of LLM quality.

Their paper showed that on March 2023, GPT-4 had 52% accuracy in the “Directly Executable” code generation category. But the June 2023 version was just 10% in accuracy.
This is BAD

Edit:

Actually for the coding part, the paper appears to be criticizing the fact that it adds markdown to the output more than before, rather than straight up code. So to them, this means the code isn’t “directly executable” This is a strange thing to compare and complain about…

The paper at least covers other aspects, not just coding.

5 Likes

They classed the model generating mark down ```'s around the code as a failure.

I’m sorry but that is not a valid reason to claim code would “not compile”. The model has been trained to produce markdown, the fact they took the output and copy pasted it without stripping it of markdown contents does not invalidate the model.

LLM’s have never been good at prime numbers, or numbers in general, they are not large math models. It also seems that they have only ran a single example for each test with a temperature of 0.1 which is not deterministic, that will lead to errors, there are lots of examples of this throughout the paper.

That is a disingenuous way to phrase this. Three people including from Stanford and Berkeley wrote a paper, the lead author of which is a student. Which means this is a paper written by one student with two faculty advisors.

The paper has some pretty substantial flaws in its methodology, one of which I noted in another thread.

I’ll be reading the rest of the paper more in depth today, but the poor work on the mathematics question leaves me skeptical.

1 Like

i completely endorse that the performance of GPT4 is deteriorating instead of improving as time goes on. Lingjiao Chen, James Y. Zou, and Matei Zaharia conducted a study to measure the performance of GPT4 over a period of time and found substantial changes, particularly a noticeable decline in its ability to solve certain problem-solving tasks.
source.

They didn’t conduct a study over a period of time longer than that required to ask the chatbot some questions. They just switched between the two available API models of the same generation in the present day. And scroll back five posts - you can see it was just discussed.

Also note the misunderstanding that markdown allows easy copying of executable code within the ChatGPT interface.