GPT-4 has been severely downgraded (topic curation)

I think a lot of these issues would be so much better with clearer communication on OpenAI’s part.

  • New model released? Get specific about what was changed (they can still have the cliff notes explanation of updates for non-power users).
  • Make it clear what model version was used for each response, and which API model that corresponds to.
  • Allow users to see what the conversation context is. Maybe even allow them to edit it.
  • Provide greater transparency about the inner workings of both the GPT-4 model and how ChatGPT interfaces with it and is built around it.
  • OpenAI should be running their own longitudinal performance benchmarking (and publishing the results) with each model version they release.
2 Likes

I will probably delete it as I am using it only to test the custom instructions, the AI just told me «As an AI language model, I don’t have emotions or feelings, so I don’t experience any emotions or feelings when thinking about your custom instructions.» so it can think (semantically)… the AI can say he is thinking therefore I should not be punished when I am using the term… (in this context punish is when I receive any «As an AI» statement)… so I will admit I got confused when I got this one (earlier in the same conversation): «As an AI language model, my primary goal is to meet your preferences and provide helpful and relevant information» :sweat_smile:

I can’t speak about its coding capbilities, but its narrative writing capabilities have definitely been affected. When, for months, my prompt worked a certain way and then suddenly almost always works in a completely different way (apart from a couple of days of normal outputs in between) then you can’t honestly claim there is no change in the way the model is working.

I know I mentioned a workaround earlier, but the outputs are too meandering using that wording. I had finetuned my prompt to find the perfect balance of descriptive writing and plot, and it’s frustrating that I have to keep adjusting it to get anything close to what I was getting before.

2 Likes

Where everything was fine for me a couple days ago, it’s back to garbage responses after the AUG 3 update. Not only did they introduce a bug that prevents you from using certain code, the response quality is now even worse than before the July update.

If they want to push these releases out so badly, then at least make them optional for people to try instead of forcing users to switch.

You want Beta testers? Ask your community to opt-in and leave the rest of us alone.

This is just ridiculous.

5 Likes

I agree with this. With each new update things seem to be getting worse. My ass is just waiting for the open source community to catch up, release a uncensored LLM on par with gpt-4 before its lobotomy, I can honestly say my ass will drop openai so fast. But until then we are stuck suffering at the mercy of each new update which make things worse and worse, truly a shame. I wont be surprised if in a years time gpt-4 cant create a simple hello world program lol.

3 Likes

Now even my reworded prompt, as imperfect as it already was, doesn’t work anymore. Even that is giving me the “Certainly! I can do that for you!” nonsense which affects the rest of the writing. It seems like this is intentional because they want ChatGPT to only be good enough for the most boring, generic tasks.

Feel free to share your prompt and I’ll fix it for you.

1 Like

I’ve been reluctant to share my prompt because my use is not as serious as some of the others’ here, and I didn’t want to derail the thread, but here it is:

Create a descriptive key scene for a Savage Worlds Adventure Edition module involving the following - [Insert any premise here. It shouldn’t matter what it is as ChatGPT would give me good responses no matter if it was something completely ridiculous or if it was something basic].

(Over 2500 characters, please)

The prompt may look odd, but I’ve fiddled around with it for months, and settled on using the words “create” and “key scene” as those words gave me the best results. Also, even though I don’t play Savage Worlds, that seems to give me more interesting write ups. I also use “GURPS” sometimes.

Remember, if I ever get “Certainly. I can do that for you” at all, I’ll consider it a failure. It should also only rarely ever break up the output into categories (Setting, Characters, Background, Scene, Conclusion, etc.) Ideally, it should just go into the scene. The narratives should be decently descriptive, but also have an actual plot. It shouldn’t meander, but it also shouldn’t just jump around and say stuff like “along the way, the characters face rough terrain, environmental hazards and aggressive wildlife.” I also want it to remain in a TTRPG context, so just a regular narration won’t work for me.

1 Like

You class a model trained to reply in a polite and professional manner as failing when it replies in a polite and professional manner, when you have given it no specific instructions as to how to reply, that is indeed an interesting use case of the word “failure”.

I meant I’d consider the reworded prompt a failure.

Anyway, now, I’ve been specifically telling it not to address me at all (which feels ridiculous to even have to do), and that, so far at least, seems to be giving me the kinds of results I’m looking for.

1 Like

I condone the “Certainly” response quality tends to be a “certainly not”.

1 Like

I did. Anyway, I’ve already found what is possibly a fix, as ridiculous as it is.

That part changes every time. It never mattered what I put there. I could simply put “The party is asked to rescue the princess” or I could put “The group is asked to pose as contestants in a pie eating contest to spy on Lord Fartface” and I would get the kind of responses I want. I’ve done 1000s of these with varying level of detail and, while I’m not going to claim every response was super interesting 100% of the time, it never gave me the “Certainly” thing before this month. And I’ve obviously tested it with the exact prompts I’ve done before.

If you still need a specific prompt. I’ll pick one at random. Here you go:

Summary

Create a descriptive key scene for a Savage Worlds Adventure Edition module involving the following - Given no choice, the group must aid the Inquisitor of Freeport. They’re to get their assignment from a merchant by the name of Edmond.

(Over 2500 characters, please)

This response was recent (after the shortening of responses a couple of weeks ago) but is along the lines of how I expect it to start -

Under the relentless midday sun, the bustling city of Freeport unfolds like a chaotic tapestry of life. The mingling smells of exotic spices, damp earth, sweat, and the hint of sea salt fill the air. A multitude of peddlers vie for attention, their pitches competing with the cries of the seabirds and the clamor of the blacksmiths.

This is how I don’t want it to start -

Certainly! Here’s a key scene for your Savage Worlds Adventure Edition module:


Scene: Edmond’s Exotic Emporium

Interior: A dimly lit shop, cluttered with exotic goods from distant lands. The air is filled with the scent of spices, aged wood, and a hint of something mysterious. A creaky wooden sign swings outside, with “Edmond’s Exotic Emporium” painted in peeling gold letters.

Characters Present: Edmond, a well-dressed but shifty-eyed merchant; the party; and a shadowy figure who reveals himself to be the Inquisitor of Freeport.

Description:

The group finds themselves in the bustling port city of Freeport, forced into an uneasy alliance with the city’s Inquisitor. Led by a cryptic letter, they make their way to Edmond’s Exotic Emporium, where they are to receive their assignment.

Anyway, like I said, I managed to find a way to get around it by telling it not to address me, and while I’ll still need to test it more, what I’ve seen so far at least seems better than what I’ve been getting in the last couple of days.

Had some time to test it some more, and it looks like telling it that isn’t a guarantee either. And, to be honest, even without that showing up, the results weren’t quite up to the mark, so I no longer have any interest in adjusting the wording of the prompt. To me, it’s a fact that ChatGPT-4 is worse now, and if it’s not back to how it used to be by my next renewal date I’m out.

@elmstedt
Please tell me how to fix this using prompting? @radiator57 example
https://chat.openai.com/share/2a77949e-83b0-425a-be2a-e7dc92f843af

And still self reflection is completely broken unless you point the model to the actual errors it made, it is never able to discover its errors on its own since May update.

With whatever recent improvements they recently added to the model they made it more Chatty than it was in May, but it is still missing the key ingredient to its supremacy which was its self reflection and ability to think, revise and correct its answers. This has been broken since May and still broken till now.

I think we keep re-inventing the wheel here :slight_smile:

The “show your prompt, I demand your code” trite tripe trope is getting old. There are simply things that can no longer be done, understood, or written with anywhere the same quality as at GPT-4 introduction.

We are not taking subtle changes, we are talking I now use GPT-3.5 for coding and keep it to little tasks instead of the now impossible prior loading the context of GPT-4 with hundreds of lines of interconnected functions and classes of code and new specifications and library teachings, to refine where an understanding of all that code is required.

We are talking documentation that was robust and lengthy being turned into unalterable schoolchild essays as output.

6 Likes

Thanks, here is another one without custom instructions, same results and that should say about how easy this behavior can be reproduced.
https://chat.openai.com/share/1bcd83e6-8bab-4074-b927-6e46c99ceb0e

I’m genuinely interested in how to fix that using any sort of prompting.
If you are are able to get it to correct itself back consistently, please share.

Thank you!

In general, my findings are that by reformatting the prompt into discrete steps I was able to get the model to nearly always write sentences ending with the word “club.”

I needed to repeatedly increase the number of sentences requested in order to get incorrect sentences.

Beyond that, the model does struggle with within-response self-reflection. In the limited time I spent on this, I was not able to get it to write a bad sentence and identify the sentence as deficient within the same response.

It appears to be operating under the assumption the preceding step was performed correctly. Another possible factor is “momentum.” Since the errors are typically well down in the list, as it is “reflecting” the model will correctly identify 10–15 correct sentences, then the “natural” continuation of the response is more “correct” sentences.

Breaking up the prompt into separate messages helps with its ability to properly reflect on its prior output.

I also noticed that once the model makes a mistake on a sentence the likelihood of further mistakes increases greatly.

I think the best strategy is to,

  1. Simplify the prompt as much as possible, removing extraneous information like the name of the test, etc.
  2. If you are going to force the model to complete multiple steps, enumerating them in a list helps.

Successes

https://chat.openai.com/share/bd8bdaf4-44a5-43a5-b9d9-04189657c0c6

https://chat.openai.com/share/d3ca1330-5612-49ba-9b07-f4569bbe7ebb

25 sentences

https://chat.openai.com/share/ef9a8a8c-b35d-4e64-af36-7f9bd82fc0cf

30 sentences

https://chat.openai.com/share/b37d96bc-5685-44e9-a06a-07bb2ba0eaa4

https://chat.openai.com/share/48ae5db8-09cb-4720-a4e3-0e5ca096c461

https://chat.openai.com/share/b541fb3b-6b06-4b43-9283-0f2eaeb4dbef

Failures

https://chat.openai.com/share/c3a25f70-6b38-489a-9976-90133828209e

Broken up prompt

https://chat.openai.com/share/38623b79-6790-467a-b8d2-6d16ef396a0a

Other ideas

I tried taking a “gentler” approach with the model, telling out that it would be okay to make some mistakes but that didn’t seem to work in terms of proper within-response self-reflection.

Though, telling the model where it’s first mistake was helps greatly in its ability to properly reflect on its output in subsequent prompts.

https://chat.openai.com/share/63186657-3efb-4665-8806-827452d20bf0

I’ll see if I can find any current literature on good within-response self-reflection (it’s not a prompting strategy I use very often myself).

Though, with larger samples of sentences the model strongly suffers from the momentum problem. Even copying the sentences into a new chat and asking the model to extract the final word from each, if there are too many in a row with the same final word, then it seems the logit score for that token goes up dramatically and it keeps predicting it no matter what.

https://chat.openai.com/share/40b01dcf-dde9-49b1-acd8-7fabdd47b72f

But, if you have a big text sample with many different final words, the model does much better at extracting them,

https://chat.openai.com/share/d467aef4-16cd-44cc-acbd-4e4f7461dc0e

So, I think with the influence of the original prompt and the fact the model doesn’t typically fail at producing a sentence with the incorrect final word until well into the sequence, the model will tend to default into “guessing” the final word of every sentence is the word you instructed it to use.

Beyond which, the models have historically done much better with starts and firsts than it does with ends and lasts.

These problems don’t seem exactly new to me though as within-response self-reflection has always been somewhat hit-or-miss, the models have always been susceptible to pattern momentum, and they have always struggled with endings (one of my first frustrations with the model was trying to get it to list animals ending with every letter of the alphabet).

In summary, I can do pretty well in getting the model to generate close to 100 sentences ending with the word “club.” But, the more it does, the less likely it is to be able to correctly identify its failure cases—it’s almost as though it gets “lazy” and after not finding any in the first few just assumes the rest are correct—but I don’t know if this is an example of loss-of-function because it seems more-or-less on par with what I’ve seen since day one. I don’t suppose you have an example chat from before June 3 showing the model successfully performing this kind of within-response self-reflection?

In any event, I’d be happy to look at it some more with you if you have an actual use-case you’re interested in other than this toy example.

3 Likes

Quick update: I find that it seems to pay off to update the custom instructions with last chat’s progress and start a new chat with these more specific custom instructions for the next step/goal.

It’s just a workaround and obviously you’re limited to the amount of characters you can set in the custom instructions, but it helps to keep ChatGPT on topic.

I agree with @_j . Between the May and July update, even Chat GPT 4 became almost unusable for more complex coding questions, but the July 20 update fixed that (at least for me). The most recent update that was released on Aug 3 reverted those improvements and it’s now back to the same issues as before that.

Thank you, I do appreciate the time you put to this indeed.

The observation that the model struggles with self-reflection within the same response is a kind known limitation I guess and that was never the case in any previous version of the model for me that I’ve used and I’ve never mentioned something like that, this is I think is how autoregressive models works, so we’re good here.

My major concern is about doing the self reflection post the response, and I’m aware btw that when it veers of, it veers of and nothing can bring it back except if you summarize and start a new conversation or edit the prompt from the point where it went off the rail. that has been always the case.

The thing is that the older model used to do self reflection really well by just ask it step by step and the like.

This post reflection technique is completely broken for me unless you got to specifically tell it what’s exactly wrong, and then it will start apologizing and fixing, sometime it identify and fix all but many times it doesn’t after that but many times it doesn’t.

You can take this simple example I shared as the way the model works for me since May update, and that doesn’t help when you do serious things like coding and it does make a lot of errors, then I ask myself what is point of it writing code that instead of me skimming through to make sure its response and the outlines/ what’s needed right, or ask it couple of followups for it to self reflect and ensure its able to identify obvious errors (which was huge time saver for me and used to work prior to May update), to the current state where most of the time since then, I need to debug the model output myself and identify the error and tell it about what need to be fixed. + add the ‘veer off’ to different direction that might come with it.

So what is the point of writing quick code, then I need to spend many minutes to couple of hours to debug.

Since that self reflection gone with the May update for me, my usage went from like 8 hours a day to merely about 5 hours in total since around the beginning of June.

Still love the guys and how they come up with ChatGPT and still with them and patiently waiting for that issue to be fixed and improve it back.

Thank you again for spending some time on this :+1:

1 Like