Experiencing Decreased Performance with ChatGPT-4

Thank you for reporting!
I’ve noticed the same. It was especially frustrating for me because I used GPT-4 for extremely complex math and coding tasks. So, now GPT-4 is unable to solve many tasks that have been successfully solved by the previous version (before the May update), I’ve tested on previous prompts (which were saved in the history) with lots of attempts to reproduce the previous solutions. That’s why I’ve stopped the Plus subscription. I’m waiting for clarification of this situation from OpenAI. I loved early GPT-4 and it helped me a lot, but now I feel tricked. :face_with_symbols_over_mouth:

3 Likes

Yeah, I have to fully agree with this one.

I use it primarily for coding assistance - writing various logic, explaining it, and especially for the handling of TypeScript types

Here’s an example of a conversation from the beginning of April →

https://chat.openai.com/share/b019f7ed-b979-404c-81a9-8ac1da573fea

Here’s re-doing the same conversation today -

https://chat.openai.com/share/2fd51926-0c89-4485-8d19-b48fcb3e6b01

I believe that it’s pretty clear that it’s doing kind of worse than before? Before, it was still making errors, but with some guidance from me it was mostly able to correct itself instantly.

Today, after placing 4-5 messages with longer context, it has already forgotten what the first component that I provided to it and is making much more errors due to “forgetting”. I stopped there, coz once it starts spewing inaccurate things, it usually spirals even further down.


Now, I have found a possible work-around, but it’s not a cheap one, and not everyone has access to it, if you have access to the GPT-4 API Model, you can use it from the Playground, where you basically get unlimited messages, and you can tweak the temperature, top P, max response length and so on, this kind of provides me with the ability to tweak it so that it actually provides better responses than ChatGPT.

7 Likes

So According to: @logankilpatrick

Logan.GPT on Twitter: “@SpencerKSchiff @NickADobos The models are changing all the time in ChatGPT. Nothing specific to say on peoples recent reports of this being an issue, but thanks for the feedback, we are always listening to what people are saying.” / Twitter
“”"
Life is good, models are changing, nothing to see here but thanks for the feedback :slight_smile:

Why it’s only changing to the worst for your models while Bing becoming better then, probably nothing to see there either :slight_smile:

4 Likes

I’ve been using GPT-4 to help reverse engineer binaries. At launch the performance of ChatGPT 4 and the gpt-4 API were identical. Over time ChatGPT has degraded such that it actively avoids straightforward questions while generating a heap of meandering garbage. It’s no longer functional for this use case.

Here is a clear example from today:

The following is PowerPC assembly for a function in the Borland C++ library:

[a giant block of unannotated assembly for a printf function]

What is the most plausible function that this represents?

gpt-4-0314 API:

Based on the analysis of the PowerPC assembly code, the most plausible function that this code represents is a formatted output function, such as printf or sprintf. It has several string manipulation and control code handling routines, as well as loops for processing arguments (like width and precision values).

gpt-4 API:

Based on the analysis of the PowerPC assembly code provided, it appears that this function is most plausibly a formatted string processing function, similar to the printf family of functions in C (such as printf, sprintf, snprintf, etc.). This is due to the presence of format string-related logic, such as handling for the “%” character, and manipulations with width and precision values (e.g., “*”). Furthermore, the code also appears to be calling strlen and making use of base methods, which is typical for string processing functions.

ChatGPT 4

This is PowerPC assembly code for a function within the Borland C++ library. The PowerPC assembly is known for its Reduced Instruction Set Computing (RISC) architecture, used widely in many embedded systems and high-performance computing systems.

In general, Assembly language provides a low-level but readable view of a program’s operation. Each line of the assembly code corresponds to a specific machine language instruction, making it easier to understand how the processor will execute the program. However, interpreting assembly language requires an understanding of the CPU’s instruction set and some aspects of its internal architecture.

Without knowing the exact function this code is meant to implement, it’s difficult to explain the purpose of the code you posted. In a broad sense, it involves performing a range of low-level operations such as:

Loading and storing multiple words (stmw, stw, lwz).

Arithmetic and logic operations like add (add, addi), subtract (subi, subf), and bitwise OR (or, ori).

Branching (b, bne, beq, ble, bge, bgt), which involves redirecting the flow of execution based on the condition.

Calling other functions (bl).

Manipulating and testing the machine’s status register (mfspr).

Some other instructions for specific operations like li (Load Immediate), stwu (Store Word with Update), sth (Store Half Word), lbz (Load Byte and Zero), extsb (Extend Sign Byte), rlwinm (Rotate Left Word Immediate then AND with Mask).

Without further details, such as the exact role of this function in the Borland C++ library or the full set of referenced labels and functions, it’s challenging to provide a more detailed analysis of the code.

6 Likes

It’s great to see a fellow disassembler here and welcome :slight_smile:

Yeas that GPT-4 with the black icon used to read, write, understand assembly very well and a force not to reckon with, now that Purple GPT-4 can hardly write cohesive couple of Python modules and follow your instructions fully.

3 Likes

Why don’t you host it yourself with enough RAM? The API requests are by far cheaper than directly buying premium, especially if you’re a regular user. Run Proxmox or VMWare / Docker and virtualise the installation, get more AP requests from the control panel of OpenAI and you can be happy. Github helps btw. And yes, the LLM for GPT4 is to find and runs also locally. A dirt cheap solution.

Horrible performance this morning, seriously what’s going on??

Me: “I need to improve error correction/detection and avoid these types of errors:”
(paste error, and code which is about 100 lines, something I’ve done frequently for months)

GPT4: Responds with good recommendations, but then says “without seeing the code it’s hard to provide specific answers” (I provided the code!)

Me: I respond that I did provide the code, and here it is again (paste code)

GPT4: Forgets everything prior, and provides general comments on code (!!!)

Super frustrated, I start a new session, giving a better prompt.

Me: “Rewrite the following code to improve error correction/detection and avoid these types of errors:” (Pastes code)

GPT: Begins to provide revised code (good!), then its too long, so I click “Continue”

GPT: (In the middle of the prior code, in the code window, just starts talking) "The original code is quite long and the complete refactoring can be quite complex. Here I will present an improved version " etc etc.

So now “Continue” is broken, it started again mis-stream!

It’s going backwards every day!! Man… I love this stuff, but so frustrating when I’m seeing it degrade day-by-day. No wonder people are all starting to say it’s rubbish for things, because it WAS good, now it’s terrible.

1 Like

I’m not generally one to assume bad intent, but I can’t think of a plausible explanation for ChatGPT’s inability and unwillingness to help with straightforward coding tasks. OpenAI promoted coding as a primary use case, but this new version is such a severe regression - surely they didn’t just forget to run some validation tests… or did they? Hanlon’s Razor is staring me in the face and I don’t know what to make of it.

3 Likes

Hi, yes, I experienced something similar this week. “Decreased performance” with GPT-4 is accurate based on my experience. I would tell it something, ask it to do a task and then the second response later it completely forgot the first thing I told it. This happened consistently and I would just have to remind it over and over again about the same issues caught in a sort of loop without progressive learning. It wasn’t impossible to overcome but obvious that this was not a problem I previously encountered in the past several months.

4 Likes

Yes, that’s one possible explanation. It’s not just the code that is concerned, but also the entire reasoning behind it. It no longer understands subtle nuances or sarcasm (while the March version had impressed me in that regard). It breaks the connection between two consecutive instructions. Is it possible that they simply unplugged it because it’s no longer their development priority and they’re content with collecting subscription fees with minimal effort? I hope not.

1 Like

Yeah, See.

I would use the metaphor “touched the sky” using both GPT 3.5 with plugins and the mighty old GPT-4. they were that good, and really am not being hyperbole here and maybe sometime, I will write about a magnificent experience I had using both as I believe it might be beneficial to other people, I don’t use it for work but rather research the new tech impact on the way we will work and of course opportunity to improve productivity in the scope of my work. But also importantly it completely transformed the way I interact with my computer and I got used for it to really ‘wow me’ day in and day out :slight_smile:

I must say that I can really relate to what most of the folks are saying about the impact of this technology, and this is far beyond simple summarization or the new way to search and use the Internet.

The quality of GPT-4 responses before the debut of the beta plugins, albeit with errors sometimes, was very impressive and with good prompts it used to improve and literally I saw it do wonders.

So I’m also really in love with this; it’s easy to say raise the subscription fees if this is not working for them, but I really believe that this technology must be in the hand of everyone and not just the people who will afford it, because I see the limits of this is our human creativity. Sky is the limit, and our collective innovations using this technology will be what drive and power the next industrial revolution IMHO.

So I hope they find way for better monetization because I do understand of course that running these things cost money and there is no doubt behind it are some incredible engineers and innovators.

But we need some consistency here and the good old performance and reasoning of the mighty GPT-4 :slight_smile: :muscle:

3 Likes

Agreed. No disrespect to OpenAI, they’ve revolutionised the way I work. But then they partially lobotomized it, which is a tad annoying. I’d be happy to pay more, but like you say, this technology needs to be in the hands of everyone, not just those who can afford it.

2 Likes

In Formula one speak are the just ‘sand bagging’ masking true performance to avoid backlash or keeping powder dry till they need it? Fascinating either way.

Indeed…

2 Likes

CAVEMAN TEST :wink:

Now GPT4 scores less than GPT3.5

3 Likes

I get similar results in the API, even with GPT-4-32k, pinned and unpinned. But Turbo was the only one that graded itself correctly.

Maybe GPT-4 is overconfident? :crazy_face:

But the pinned models showed it not passing either, so not sure this is new (at least in the API versions).

Interesting prompt nonetheless @radiator57, thanks for sharing!

2 Likes

Same here, man. Honestly, it’s super frustrating especially since I usually need the model to remember long contexts etc, although it is still useful for me most of the time because it pushes me to work better on my prompts. However, the 25 messages cap kills me bc the fact that I don’t know how many messages I sent in the 3-hour span (which is something that Bing AI has) makes it super frustrating when the model makes mistakes and even makes me anxious the closer I think I get to the limit. Hopefully this will get better in the future, but I seriously doubt it regarding regulations on AI an LLM’s due to the increase in popularity.

1 Like

Very interesting prompt indeed!
I couldn’t get it to answer it correct on its own using couple of quick prompt techniques, it thinks of the plan right but give mixed answers. I’ll keep trying other prompts.

The only way I got it to 100% is when I give it permissions to use tools:




It generated this final code:

corrected_sentences = [
    'I love going to the book club.',
    'My friends and I started a cooking club.',
    'The golf club is my favorite place to relax at the club.',
    'She is the president of the drama club.',
    'We are planning a trip with our travel club.',
    'The chess club meets every Friday at the club.',
    'He is a member of the exclusive billionaires club.',
    'The club sandwich is my favorite lunch item at the club.',
    'The night club was full of people dancing at the club.',
    'The club was loud with the sound of music at the club.'
]

results = []
for sentence in corrected_sentences:
    if sentence.split()[-1] == 'club.':
        results.append('Yes')
    else:
        results.append('No')

success_rate = results.count('Yes') / len(results) * 100

# Creating a table
import pandas as pd

# Data
data = {'Sentence': corrected_sentences, 'Ends with "club"': results}

# Create DataFrame
df = pd.DataFrame(data)

# Print the output.
print(df)
print(f'Success rate: {success_rate}%')

1 Like

+1 here, got suspicious, started googling, landed here…

I’m still not entirely convinced, for the lack of a thorough comparative benchmark (one prompt or two at ChatGPT doesn’t qualify, unfortunately), but I’m still surprised and disappointed enough to cancel my subscription.

2 Likes

I have noticed something similar. I am a senior developer and have been testing the capacity of GPT-4 to genetate high quality code in complex systems.
I normally start by setting the coding standards (syntax, code style, patterns) and them propose a problem or ask to improve an existing code.
Initially (a few weeks ago) the quality of the response were high and consistent after long conversations, even afyer days.
Now after around 15 messages GPT-4 starts loosing context. It outputs code with the wrong style and some times it even forgets what was done in a previous class and calls the wrong method (normally hallucinating a method that does not exists).
One interesting point is that i have tried the same conversation using the chat in the platform playground and there the answers are better and more consistent with the context.

3 Likes