GPT-4 has been severely downgraded (topic curation)

Well said. I really would like to see more community engagement.

1 Like

I canceled my subscription due to this. I get the feeling that theyā€™re going to go the route of cable companies and specialize LLMs. If you want coding, pay for this, if you want general chat ai, pay for that, if you want maths, pay for thisā€¦it feels like the freaking 80s & 90s all over againā€¦rinse and repeat. Perhaps I am missing somethingā€¦

4 Likes

There are many fundamental concerns with this paper.

Here is one demonstrating that GPT-4 actually improved from March to June,

Leetcode accept june_fixed june_orig march_orig
True 35 5 26
False 15 45 24

Source: Deceptive definition of "directly executable" code Ā· Issue #3 Ā· lchen001/LLMDrift Ā· GitHub

Iā€™ve already written my concerns about the methodology for their test of mathematics ability, evaluating if a number is prime or not, which I will attempt to summarize here,

  1. GPT-3.5 and GPT-4 are large language models, and while math is a language and they have shown emergent capabilities in the field of mathematics, evaluating if a number is prime or not is not a good test of mathematics ability, I would be hesitant to say it is a test of mathematical reasoning at all.
  2. They tested using only prime numbers. The problem with that is we cannot discern if the models have lost (or gained) reasoning ability or if the models have any bias for answering ā€œyesā€ or ā€œnoā€ to the question ā€œIs [number] a prime number?ā€ If they had included composite numbers in their tests, we would have a much clearer picture of what is happening here because we could compare the proportion of composite numbers the models identify as prime with the number of prime numbers the models identify as prime.
  3. They used a temperature of 0.1. There is nothing inherently wrong with choosing this temperature, but it,
    a. Does not represent the expected behaviour of ChatGPT where the temperature is 1.0.
    b. Suggests they should have done more than one replication for each number to account for the variance of the model. Then they could have set a threshold, say 75%, at which the model would be considered to have correctly answered the question. E.g. run each number 20 times. If the model gets the correct answer 15 times or more, it gets credit for being correct on that question.

Now, I havenā€™t yet had time to dig through the rest of the paper, but if these issues are immediately apparent in a 2-minute read-through of the paper, I suspect there are other issues as well.

It should also be noted that this is a draft paper, it has not been peer-reviewed. I do not imagine this paper would be published in any worthwhile journal in its current state, and I am doubtful the issues could be correctedā€”especially since the March model is no longer publicly available.

4 Likes

I canceled my account. Iā€™ll gladly return once there is some transparency.

3 Likes

To anyone from Openai looking for examples of degredation, just give us access to a version of Gpt4 from earlier than May and one from May onwards.

Personally I deleted all my chats due to a naive assumption that this action might help restore the earlier (superior) functionality of the AI.

So I donā€™t have my original conversations but I can recreate them quite easily, as I still have copies of the unedited subject matter. It would be exceptionally easy to demonstrate the degradation. It would be a couple of days work due to the rate limits, but I believe I could create a substantial amount of evidence.

Also, I too would be willing to pay more for access to the pre-May version. It was vastly superior (this is a fact, not an opinion) for coding.

8 Likes

Just cancelled my subscription as well. This downgrade is not even subtle. The difference in capacity of the GPT4 model from just a few weeks ago is grotesque, huge, unquestionable, obvious, simply impossible to deny for anyone that uses it as a coding helper/time-saver. Iā€™m wasting more time checking and fixing itā€™s mistakes than saving time at this point. And just to point out: This was EXACTLY the same downgrade I experienced with 3.5 model, right before they launched the GPT4 version. That is making a pattern clear here in my opinion. Just shameful.

11 Likes

See my post over here. It was announced today to extend the 0301 and 0314 ā€œsmartā€ models and make sure the new models are ā€œsmartā€ before deprecating the original ones. This should be good news for you!

4 Likes

Yes, thatā€™s good news on the API side of things. Sorry I didnā€™t make myself clear, but I was talking about the web interface, which is the one I use manually as a ā€œcoding acceleratorā€, so to speak. Thatā€™s the one I pay the PLUS subscription for. I donā€™t really use GPT-4 model in the API due to itā€™s cost right now. I use the 3.5-turbo model for my applications. But thatā€™s good news nonetheless, thank you for pointing that out.

3 Likes

Understood. You can use the API version in the Playground, which is a web interface (no coding required) ā€¦ which I mentioned in the post above in the linked post.

But what this means, I think, is that the next model has a good shot at being ā€œsmartā€ for ChatGPT. So wait and see, I guess, if you only want to run ChatGPT ā€¦

But Playground/API has them now with no wait.

Understanding, of course, the different pricing models between API and ChatGPT.

Link here for clarification on the other post I am referring to:

4 Likes

You know, I hadnā€™t realized before that I could use the older version (from March) in the playground. Iā€™ll give it a try. Itā€™s unfortunate that it incurs costs at the GPT-4 model level with every request I make there. Iā€™m on a really tight budget, but at the very least, I have the $20 Iā€™m saving from the subscription I cancelled. I believe that will allow me to make quite a few requests in the playground without exceeding my budget. It might even be more requests than I usually make in a month, I donā€™t know, Iā€™ll have to check. Anyway, thank you so much for the heads up. Itā€™s helped me significantly. Cheers!

6 Likes

:rofl:

3 Likes

The original multiple topics received a lot of attention and a lot of responses.
Since you aggregated them, I think you should post summaries, response counts, etc. on the previous topic.
(Isnā€™t that the role of a user with moderator privileges?)
Also, please add the tags that were given to the original topic.

I look forward to your meticulous work.

*I made a few mistakes because Iā€™m not used to posting. sorry.

3 Likes

A few days ago Logan had to change the users who were not OpenAI employees from a full moderator to just category moderators because as full moderators we had access that needed to be restricted for future OpenAI plans. As such some of what you seek I can no longer do.

They can still receive the attention (being viewed) and this topic allows for responses.

The larger ones had summaries posted in them a few days before they were closed.

Added to first post as images.

I donā€™t read minds, I have no idea what you seek.

Done (double entendre)

2 Likes

In May, I was using GPT-4 to write a novel. I had a very long chat, and GPT-4 remembered every single detail from the beginning to the end. At some point, it started responding randomly and out of context. I thought it was a temporary issue or that I had broken my chat, but as I continued, I realized there had been a downgrade. Now, I have finally found other people who noticed the same. I believed in the project and have been a plus user since day 0, but now I will cancel my subscription until there is an official response :unamused:

6 Likes

GPT-4 in itā€™s current incarnation is an 8k context model. 8k of tokens is approximately 6000 English words, less for code or symbolic languages. If your current conversation had references within the last 6000 words to the prior content, then it will be able to infer context. As soon as all reference is lost to topics more than 6000 words ago, the model can hallucinate facts if it is required to comment on them.

This limitation has been present from the initial release and has not changed.

This is not the case. I started a new chat for new chapters. Now, itā€™s really difficult for me to work as I did before. Even after a few messages, he forgets the context and starts to invent.
It looks dumber than before. Itā€™s a real shame that itā€™s going this way.

4 Likes

Since yesterdayā€™s update and setting the custom instructions, Iā€™m experiencing better responses again (ChatGPT web). I have not experienced any issues today or had to re-explain things over and over. If it wasnā€™t for someone here in the forum pointing out the new custom instructions settings under the beta features tab, I wouldnā€™t have even known this was added. Would have been nice if this had popped up somewhere after it was added but maybe I just missed it somehow.

Either way, in my opinion itā€™s looking better than the last couple weeks, so letā€™s hope it stays that way.

1 Like

Not only has he become forgetful, but my issue is that he no longer understands what Iā€™m asking, not even when Iā€™ve added countless prompts and explanations. Yet, back in May, he seemed to be able to pick up on most of the implications and clues in my prompts - he was keeping up with me better than most people could.

Itā€™s truly saddening.

4 Likes

The quality (especially for coding-related tasks) is indeed heavily deteriorated.

Moreover Code Interpreter now says that it is based on GPT3 model.

But it seems the usage cap for GTP4 is still applied here, while I most of the message lately just correcting inconstancies and problems, spending more time on this then savingā€¦

1 Like

Beta models have been saying they are based on GPT-3 all the time.