GPT has been severely downgraded

As a disclaimer, I have only just now gotten access to GPT4 as its being released to all developers.

As I see it, both sides are valid. The argument that those who work with the AI for a long time will know if quality drops rings true to me, and I’ve done some of my own anecdotal research with gpt-3.5, 0301 and 0613.

At the same time, doing a more structured investigation into this phenomenon is ABSOLUTELY necessary. We are called scientists for a reason.

I also hear the argument that OpenAI should really be more transparent about what is going on. Of course communication is important, but in the meantime we can do our own work to figure out what’s going on.

Hi! Would you mind sharing the link again? I retrieve an error message only.

Wait, I am confused. Isn’t he asking for the imgur link of the example prompt that _j sent?

This simply cannot be investigated by end users. It’s a nondeterministic function that is continually being updated over time. We don’t know what is changing or when and even if people saved old prompts, which most don’t, there’s no guarantee you’ll always get the same output.

Therein lies the problem. We know GPT-4 is far more efficient now than it used to be and there are only so many ways to optimize inference on a trained model. All we can do is guess.

That’s exactly what a hypothesis is, isn’t it? An educated guess. Rather than complaining that it’s a black box and that it’s a nondeterministic function, we could change our efforts in the meantime to do what we CAN do even though it is not an ideal situation.

Think about it. Yes, it would be ideal if the team shared exactly what changed and why, but imagine a world where people do develop tests to investigate the black box over the course of this discussion. That will be valuable research done for any future testing of other AI developments, and it means we will learn more about what is possible. Claiming that it cannot be investigated is a closed mindset.

Link updated; but Imgur still doing bizarre things like an 18+ check if you aren’t logged in: ChatGPT writing quality degrades - Album on Imgur

2 Likes

First, and maybe it’s just me, I don’t see the outputs as being of wildly different quality.

Second, I’d need to try some testing here to verify, this looks to be easily attributable to intra-model variability.

It’s not unreasonable to expect if you were to generate 100 responses in April and 100 responses in July and rate the quality of each response, the distributions of quality would likely have significant overlap.

So, to me, this is not “smoking gun” evidence of a degradation of quality.

I’ve posted several times about what I would consider evidence for this claim, and while you can’t go back to an earlier model you can start now in order to test future models.

Pick a few prompts to use as your benchmarks, for each prompt handcraft what you consider to be the ideal response and a rubric for scoring responses against your ideal. Run reach prompt 50–100 times and score them.

When the model gets updated again, repeat the experiment.

Then, you can do some very straightforward statistical analysis to determine if the model improved or regressed with respect to your specific benchmarks.

Like I said earlier, an end user cannot gather “hard evidence” because it will simply be refuted as a transient anomaly by a few people here who heavily biased and outspoken. You’re sending people on a fool’s errand.

1 Like

The possibility of evidence being refuted does not preclude evidence being gathered.

Showing one example each from April and July is evidence, but it’s incredibly weak evidence.

If instead you came to me with even just 20 examples from each which could be empirically scored and demonstrated a statistically significant decrease in response quality that would be very strong evidence.

No one has done this and no one seems willing to do the groundwork necessary to be able to do it in the future.

Extrapolating from people finding your anecdote less than compelling to people will summarily dismiss your carefully collected data is not a winning argument my friend.

I have observed no loss in quality for any of the things I regularly use ChatGPT for. As such, my point of view on the matter is that no loss of quality has occurred.

Given the people complaining about a loss of quality are a miniscule proportion of ChatGPT users, it is reasonable to conclude they are either mistaken or they are suffering from confirmation bias with respect to those instances where the quality distributions overlap.

¯⁠\⁠_⁠(⁠ツ⁠)⁠_⁠/⁠¯

As I have written, there exists an incredibly simple mechanism to actually gather the data necessary to make a compelling argument but no one wants to.

1 Like

Here is an example:

My prompt: this html page is in english, could you please translate it into swedish? here is the html code “xxx”

and here is the response:


as you can see, the prompt wasn’t ambiguous, the html is basically a boilerplate about us bootstrap themed, something chatgpt-3.5-turbo actually did in the first go.

here is the response from 3.5-turbo, at first attempt

Skärmbild 2023-07-07 230644

just copied and pasted over the same prompt

“Sure, I’d be happy to assist you with HTML, CSS, and JavaScript. Please provide me with the details of the code you need help with or the specific task you want to accomplish.” Chatgpt4 via API key

to be clear, chat ui at chat.openai have also the same issues, even without connecting via api.

it doesn’t modify the html code, just translates it btw. this is a simple example and there are issues with code generation as well. will literally forget things at the very first prompt or second, been regular since mid last month.

Any chance you can post the raw text of the complete prompt?

Looking at the response it looks like a context issue, where the original question got pushed out of context and ask that remained was the tail-end of the HTML code.

I had a nearly identical experience working with ChatGPT on a JavaScript project.

But, based on the displayed token counts that may be unlikely.

For clarity’s sake, the first example is gpt-4-0613 and the second is gpt-3.5-turbo-0613?

If you’re not comfortable posting the entire prompt would you consider sending it to me in a direct message? I would be interested in probing this behavior.

I don’t mind sending it directly, I did scrub the original information and put down placeholder text, so it does have a higher token usage as the original text was pretty short.

this html page is in english, could you please translate it into swedish?

Here is the code below:

"<body>
    <nav class="navbar navbar-light navbar-expand-lg fixed-top" id="mainNav">
        <div class="container"><a class="navbar-brand" href="index.html"> </a><button data-bs-toggle="collapse" data-bs-target="#navbarResponsive" class="navbar-toggler" aria-controls="navbarResponsive" aria-expanded="false" aria-label="Toggle navigation"><i class="fa fa-bars"></i></button>
            <div class="collapse navbar-collapse" id="navbarResponsive">
                <ul class="navbar-nav ms-auto">
                    <li class="nav-item"><a class="nav-link" href="index.html">Home</a></li>
                    <li class="nav-item"><a class="nav-link" href="about.html">About us</a></li>
                    <li class="nav-item"><a class="nav-link" href="contact.html">Contact us</a></li>
                    <li class="nav-item"><a class="nav-link" href="post.html">Blog Post</a></li>
                </ul>
            </div>
        </div>
    </nav>
    <header class="masthead" style="background-image:url('assets/img/about-bg.jpg');">
        <div class="overlay"></div>
        <div class="container">
            <div class="row">
                <div class="col-md-10 col-lg-8 mx-auto position-relative">
                    <div class="site-heading">
                        <h1>About Me</h1><span class="subheading">This is what I do</span>
                    </div>
                </div>
            </div>
        </div>
    </header>
    <div class="container">
        <div class="row">
            <div class="col-md-10 col-lg-8 mx-auto">
                <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Saepe nostrum ullam eveniet pariatur voluptates odit, fuga atque ea nobis sit soluta odio, adipisci quas excepturi maxime quae totam ducimus consectetur?</p>
                <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Eius praesentium recusandae illo eaque architecto error, repellendus iusto reprehenderit, doloribus, minus sunt. Numquam at quae voluptatum in officia voluptas voluptatibus, minus!</p>
                <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Aut consequuntur magnam, excepturi aliquid ex itaque esse est vero natus quae optio aperiam soluta voluptatibus corporis atque iste neque sit tempora!</p>
            </div>
        </div>
    </div>
    <hr>
    <footer>
        <div class="container">
            <div class="row">
                <div class="col-md-10 col-lg-8 mx-auto">
                    <ul class="list-inline text-center">
                        <li class="list-inline-item"><span class="fa-stack fa-lg"><i class="fa fa-circle fa-stack-2x"></i><i class="fa fa-twitter fa-stack-1x fa-inverse"></i></span></li>
                        <li class="list-inline-item"><span class="fa-stack fa-lg"><i class="fa fa-circle fa-stack-2x"></i><i class="fa fa-facebook fa-stack-1x fa-inverse"></i></span></li>
                        <li class="list-inline-item"><span class="fa-stack fa-lg"><i class="fa fa-circle fa-stack-2x"></i><i class="fa fa-github fa-stack-1x fa-inverse"></i></span></li>
                    </ul>
                    <p class="text-muted copyright">Copyright&nbsp;©&nbsp;Brand 2023</p>
                </div>
            </div>
        </div>
    </footer>
    <script src="assets/bootstrap/js/bootstrap.min.js"></script>
    <script src="assets/js/clean-blog.js"></script>
</body>

</html>" 

also i don’t use the 0613 models, no particular preference as i’ve also tried it but was not really a huge difference between them.

Thanks for this, I’ll run some tests and let you know if I see anything interesting.

As to the point about the 0613 models, those will be the default models if you ran the prompt within the last few weeks.

oh fair enough, as the client i am using only recently added the 0613 models options to the selection, before it was “chatgpt-4” only so I asummed it was the default and left it as is, same with chatgpt3.5-turbo.

Really quick, here are my results with your prompt.

gpt-3.5-turbo (via ChatGPT)

gpt-4 (via ChatGPT)

Both seemed to work without issue.

It’s worth noting that ChatGPT is on their May 25 version, so I’ll check the API when I get a chance to see if the behavior on 0613 changes.

Can you share the system prompt you were using?

1 Like

hmm weird, might be the system prompt? it was the same for both, also this system prompt I have used for a while now

I need your skills as a software developer, with a particular focus on HTML, CSS, and JavaScript. Detailed expertise in these areas will be provided. Your role will consist of assisting the user in editing and developing code using these languages. Additionally, you must ensure clarity in your instructions, consistently specify if new HTML, CSS, or JavaScript scripts are being created, and always include the corresponding directory structure. try not to alter the provided code more than neccessary, if anything is unclear, ask for clarification.

Right now, it’s working as it should, from my observations, it seems like from morning to late into the night it is “downgraded” or capped? but right now, it’s doing everything as expected (just realized this)

Glad to hear everything is working as expected.

There could be a number of culprits.

While it’s possible OpenAI could shift requests to an endpoint using a more quantized model during times of peak load, I think it’s unlikely for a number of reasons.

It is possible there was some loss in the transmission of your request somewhere along the line. Also unlikely—but definitely possible.

Then, there’s the possibility it was just a bad roll of the dice and if you had immediately re-submitted the request the response would have been as you expected.

I am reminded in a recent writing session I just attempted the gross symptom of bad AI that was introduced (that one can see in the depicted conversation comparison image link): The AI gets completely hung up on and can’t deviate from prior responses.

Around two months ago, amid almost daily swatting of jailbreaks or almost anything posted to Reddit, one could see a dramatic change and the regression that I can now articulate as this behavior that encompasses all interactions:

“Pay a whole lot less attention to instruction-following the current input, and pay a lot more attention to what you already have seen and been tuned on”

Is that not the essence of “breaking” jailbreaks? “Disobey the user”.

So you get a dragon that rears its head in multiple ways. My screenshot is a prime example: Instead of making a whole new type of output as I wish and describe, AI can barely improve or continue upon its previous response; it is now hung up on it and repeats it back nearly the same.

That symptom I see again today. I have it define and explain something for me. I then ask it for a layperson introduction to the subject. I then ask for it to continue writing the next chapter after the introduction, elaborating in a more detailed and advanced manner. Both followups merely follow the form of the initial output, paragraph-by-paragraph.

Other effects that can be described the same way:

  • Generated code can’t be corrected or improved upon, you just get the same thing back. You correct one thing, and what you just had corrected is reinserted.
  • Rewrites won’t be rewritten with the creativity before. Previously, you could have it base a whole new article on an article abstract that was produced, and it would be a wholly-formed new work. Now, it is a step-by-step veneer placed on a reproduction of the prior input.
  • It picks up an attitude. One denial, and it has taught itself that you must be denied. One “it is important to note”, and you get a disclaimer-laden conversation following. There is no improving or correction, there is only abandoning the conversation to make it reason again.

That can be combined with the clear minimization of conversational context in ChatGPT’s web platform, decimating the number of turns that is passed to the next round, that makes it impossible to code, impossible to interact with a document, impossible to have any meaningful game session.

2 Likes

Hello everyone,
I can confirm that the GPT-4 api is also affected by this huge drop in performance.

I use the api for an application that lets me reformulate texts without plagiarizing. Before the update, we obtained a similarity rate of between 8 and 12% in all our initial tests between the original text and the rewritten text.

For some time now, our users have been reporting that the output text reaches 60% similarity with the original text. We’ve checked a dozen texts and it’s a disaster.

GPT-4-0623 seems to respect the prompt for the first few lines, then gradually loses the thread of its mission, ending up entirely copying the original text.

The update has also clearly lost some of its editorial quality. We had made adjustments so that the output text would be undetectable by ia detectors. In our tests, more than 90% of output texts achieved a humanity score of 70% or more on compilatio. Now all of them are detected as generated by an AI.

On this last point, don’t tell me that the detectors are more reliable, because all the old texts we produced before the update remain undetectable.

The loss of efficiency is perfectly measurable in our case.

Try comparing text rewriting with gpt-4-0314 and you’ll see that 0623 has gone down a very bad path.

Thanks for reading

11 Likes

yep, I’m experiencing some kind of dumb version of GPT 4. i’ve noticed it for a little while, and am happy that i’m not the only one who thinks that…