Does anyone have any real proof that theres been in a degradation in GPT-4's performance?

I’ve been using GPT-4 since launch and I honestly haven’t noticed much
But I have also seen a lot of comments about this all over the internet and I really haven’t been able to find much in the way of actual proof. A lot of claims but when I try to replicate the problem the results don’t seem to bad to me.

I however, would also really like to know if there has been a degradation in quality over the last 7 weeks. Does anyone have any actual proof? Side by side comparisons, examples of it making mistakes etc

And if are using the API can you please share what model parameters you are using? Temp, top-p, frequency penalty etc.

Again, this isn’t met to be me being hostile towards these claims. I just have not been able to find a real concrete example of this.

Thanks in advance.

4 Likes

yes here it is…

For me, I find that the speed increases and the level of useable reasoned response goes down. Current plus model, for me, appears to be running predominantly from GPT3.5. GPT 3.5 isn’t good enough for what I want to do. It sends me down too many rabbit holes and it doesn’t seem to have the same memory.

For me it has been the opposite, it has only gotten better. None of my experiments showcases degradation on GPT-4’s performance. But you can check, just use the older models to test out any eval you wish. The API has older keys before updates, and to be fair, I find it hard to believe openai would release a worst version of it.

yeah, 3.5 is to be used with 16k when 4 can’t handle the size of the context

I just have the feeling. I don’t have any hard measurements. What measurements of performance do you use?

I think without some kinda comparison to previous responses, its hard to make any conclusions on this.

Because, from day one, I have noticed mistakes like that from GPT-4. Its not perfect by any means and this could just be explained by you noticing more mistakes as you use it more

And beyond that, I think its performance with plugins is a seperate beast. Theres a lot more thats different there and its difficult to. find a direct comparison from before any precieved change.

I actually think they are using some kinda agent system behind the scenes to interact with the plugins which I believe is GPT-3

A really simple way would just be to look for old responses in your chat history and look for anything you find to be impressive. Try the same prompt again with GPT curretly and do it at least a few times and post the results.

I could even whip up a quick python program to do this testing automatically with the API Could even run through multiple model params to see how much that changes anything.

I record my AI conversations

I don’t have time right now to watch the whole thing? could you describe more what you think the problem is in your video?

Because, honestly, its never been perfect and its never really been able to do math.

I have almost 10+ team members across 3 gpt plus accounts. Each one has a story to tell, how wonderful GPT was ++…

  1. It forgets the context easily, just 4 chat’s away is the biggest issue.
  2. With plugins it is much worse.
  3. Its amazing coding ability is now 50% level.
    If no one, from GPT Team, really listen, no point in ranting…
1 Like

The 3.5 version in the Chatgpt app appears to be running on Davinci 002. When in the app select gpt-4 and then swich back to gpt-3.5, now look att the adress field, it says “text-davinci-002”. if they do this with the gpt-3.5 than probably they are doing something to the gpt-4 too.

it’s not as good. I use it daily. I use it a lot to write JS functions faster. Previously, i could ask it to do something like nest data in state objects, it understood well and performed. Try asking it to nest data in a state object today, and it’s not usable. It honestly wastes more time trying to get it right than it would to just write the functions. I’m disappointed. I don’t see any real difference btwn API vs GPT-4 UI.

Recently as well I’ve noticed that it forgets context very quickly. Things as simple as using a different library is forgotten within a couple messages.

Pure speculation: they are a bit more aggressive in their truncation techniques. Possibly to accomodate the ridiculous tokens that plugin schemas require.

Although it’s been much consistent in the code it outputs, it’s now noticeably worse in memory/context.

I don’t understand it. They had two great branches. Instruct models, and Chat models.

Now the chat model seems to be more suited for single-shot instructions (such as function calling or content generation). For me, DaVinci still reigns king for any fun, creative conversation. But I’m only speaking about ChatGPT & not the API models.

We need a framework of prompts to test with and results, otherwise this discussion ends up meaningless.

I noticed that ChatGPT with GPT-4 forgets context quickly, as others in this thread have mentioned.
I used to paste it larger documents in two parts, and prompted it to wait with the task until I’ve entered both parts, but now it seems to forget the entire conversation after I’ve pasted the second part.
Apart from that it works fine.
But yes, forgetting the context is an actual issue I’m noticing, but I only tested that in the ChatGPT version.

As for the plugins:
GPT-4 with Bing is a construction area and a big mess.
Since I’m working on something very similar, I think the issue lies within the prompts (if you ever get it to find and actually read anything at all, first), you get a huge amount of data and need to balance the prompts well, in order to achieve good results. And frankly, I never tried it with Bing, in my own tool I use Google APIs.

1 Like

I used to be able to paste something like this, and GPT-4 would instantly find the problem. Now, it didn’t even recognize where the problem is. And it’s just a syntax error having an extra }:

db.attributes.updateMany({"terms.0": { $exists: true}}}, { $unset: {"terms.$[].attributeId": 1}})

clone(t={}){const r=t.loc||{};return e({loc:new Position("line"in r?r.line:this.loc.line,"column"in r?r.column:...<omitted>...)} could not be cloned.

GPT-4 responded with “The MongoDB query you’ve written looks syntactically correct for…” with a whole load of information, none of which helps me.

It’s definitely not syntactically correct. The problem here is that OpenAI gets paid per token via API, regardless of the quality of response. So the incentive is to increase throughput of tokens, not quality of response. That’s the issue. Ever since GPT-4 became much quicker, the quality dropped substantially.

Originally, GPT-4 would never overlook such a basic semantic mistake. I am working on in-house Llama-2 fine-tuning for majority of my AI workloads, with only limited API calls to OpenAI. With performance degradation and such restrictive content moderation, OpenAI’s GPT is simply losing it for me.

1 Like

I would say it is common sense to place your resources where they are most profitable. All resources being limited, there will always be a phenomenon of technological inflation. It seems clear to me that the real money to make is not in personnal uses of Chagpt or the API, but elsewhere.

On performance issues, the literature shows that:

  1. bigger is not always better
  2. training on AI generated data degrades models
  3. remaining presence of false and biased data in the training corpus
  4. human reinforcement causes long-term global degradation

I’m certainly curious about the real capacity and performance of GPT-4 evolution with max ressources and vector memory and multi agents system.

Hope the gap performance with opensource llms will continue to decrease.

And there are the cost question…even more when you pay for a lot of unrelevant answers

I agree. Last week I was able to post over 500 lines long python code and GPT would went through it, examine and helped me wherever was needed. This week it became so super dumb. It fails at very basic levels. So sad, because it used to be a great tool, very helpful. Today it’s a joke :-/

I’m curious how much of GPTs training data is comprised of AI generated content, which does degrade performance. When you can’t differentiate the two properly, your training data gets contaminated. On top of that, it’s been shown that training for “safety” (aka censorship) actually degrades the performance and reasoning of the model. It’s been shown that as GPT-4 was trained more for so-called safety, its reasoning skills overall have declined.