Well said. I really would like to see more community engagement.
I canceled my subscription due to this. I get the feeling that theyāre going to go the route of cable companies and specialize LLMs. If you want coding, pay for this, if you want general chat ai, pay for that, if you want maths, pay for thisā¦it feels like the freaking 80s & 90s all over againā¦rinse and repeat. Perhaps I am missing somethingā¦
There are many fundamental concerns with this paper.
Here is one demonstrating that GPT-4 actually improved from March to June,
Leetcode accept | june_fixed | june_orig | march_orig |
---|---|---|---|
True | 35 | 5 | 26 |
False | 15 | 45 | 24 |
Source: Deceptive definition of "directly executable" code Ā· Issue #3 Ā· lchen001/LLMDrift Ā· GitHub
Iāve already written my concerns about the methodology for their test of mathematics ability, evaluating if a number is prime or not, which I will attempt to summarize here,
- GPT-3.5 and GPT-4 are large language models, and while math is a language and they have shown emergent capabilities in the field of mathematics, evaluating if a number is prime or not is not a good test of mathematics ability, I would be hesitant to say it is a test of mathematical reasoning at all.
- They tested using only prime numbers. The problem with that is we cannot discern if the models have lost (or gained) reasoning ability or if the models have any bias for answering āyesā or ānoā to the question āIs [number] a prime number?ā If they had included composite numbers in their tests, we would have a much clearer picture of what is happening here because we could compare the proportion of composite numbers the models identify as prime with the number of prime numbers the models identify as prime.
- They used a
temperature
of0.1
. There is nothing inherently wrong with choosing this temperature, but it,
a. Does not represent the expected behaviour of ChatGPT where the temperature is 1.0.
b. Suggests they should have done more than one replication for each number to account for the variance of the model. Then they could have set a threshold, say 75%, at which the model would be considered to have correctly answered the question. E.g. run each number 20 times. If the model gets the correct answer 15 times or more, it gets credit for being correct on that question.
Now, I havenāt yet had time to dig through the rest of the paper, but if these issues are immediately apparent in a 2-minute read-through of the paper, I suspect there are other issues as well.
It should also be noted that this is a draft paper, it has not been peer-reviewed. I do not imagine this paper would be published in any worthwhile journal in its current state, and I am doubtful the issues could be correctedāespecially since the March model is no longer publicly available.
I canceled my account. Iāll gladly return once there is some transparency.
To anyone from Openai looking for examples of degredation, just give us access to a version of Gpt4 from earlier than May and one from May onwards.
Personally I deleted all my chats due to a naive assumption that this action might help restore the earlier (superior) functionality of the AI.
So I donāt have my original conversations but I can recreate them quite easily, as I still have copies of the unedited subject matter. It would be exceptionally easy to demonstrate the degradation. It would be a couple of days work due to the rate limits, but I believe I could create a substantial amount of evidence.
Also, I too would be willing to pay more for access to the pre-May version. It was vastly superior (this is a fact, not an opinion) for coding.
Just cancelled my subscription as well. This downgrade is not even subtle. The difference in capacity of the GPT4 model from just a few weeks ago is grotesque, huge, unquestionable, obvious, simply impossible to deny for anyone that uses it as a coding helper/time-saver. Iām wasting more time checking and fixing itās mistakes than saving time at this point. And just to point out: This was EXACTLY the same downgrade I experienced with 3.5 model, right before they launched the GPT4 version. That is making a pattern clear here in my opinion. Just shameful.
See my post over here. It was announced today to extend the 0301 and 0314 āsmartā models and make sure the new models are āsmartā before deprecating the original ones. This should be good news for you!
Yes, thatās good news on the API side of things. Sorry I didnāt make myself clear, but I was talking about the web interface, which is the one I use manually as a ācoding acceleratorā, so to speak. Thatās the one I pay the PLUS subscription for. I donāt really use GPT-4 model in the API due to itās cost right now. I use the 3.5-turbo model for my applications. But thatās good news nonetheless, thank you for pointing that out.
Understood. You can use the API version in the Playground, which is a web interface (no coding required) ā¦ which I mentioned in the post above in the linked post.
But what this means, I think, is that the next model has a good shot at being āsmartā for ChatGPT. So wait and see, I guess, if you only want to run ChatGPT ā¦
But Playground/API has them now with no wait.
Understanding, of course, the different pricing models between API and ChatGPT.
Link here for clarification on the other post I am referring to:
You know, I hadnāt realized before that I could use the older version (from March) in the playground. Iāll give it a try. Itās unfortunate that it incurs costs at the GPT-4 model level with every request I make there. Iām on a really tight budget, but at the very least, I have the $20 Iām saving from the subscription I cancelled. I believe that will allow me to make quite a few requests in the playground without exceeding my budget. It might even be more requests than I usually make in a month, I donāt know, Iāll have to check. Anyway, thank you so much for the heads up. Itās helped me significantly. Cheers!
The original multiple topics received a lot of attention and a lot of responses.
Since you aggregated them, I think you should post summaries, response counts, etc. on the previous topic.
(Isnāt that the role of a user with moderator privileges?)
Also, please add the tags that were given to the original topic.
I look forward to your meticulous work.
*I made a few mistakes because Iām not used to posting. sorry.
A few days ago Logan had to change the users who were not OpenAI employees from a full moderator to just category moderators because as full moderators we had access that needed to be restricted for future OpenAI plans. As such some of what you seek I can no longer do.
They can still receive the attention (being viewed) and this topic allows for responses.
The larger ones had summaries posted in them a few days before they were closed.
Added to first post as images.
I donāt read minds, I have no idea what you seek.
Done (double entendre)
In May, I was using GPT-4 to write a novel. I had a very long chat, and GPT-4 remembered every single detail from the beginning to the end. At some point, it started responding randomly and out of context. I thought it was a temporary issue or that I had broken my chat, but as I continued, I realized there had been a downgrade. Now, I have finally found other people who noticed the same. I believed in the project and have been a plus user since day 0, but now I will cancel my subscription until there is an official response
GPT-4 in itās current incarnation is an 8k context model. 8k of tokens is approximately 6000 English words, less for code or symbolic languages. If your current conversation had references within the last 6000 words to the prior content, then it will be able to infer context. As soon as all reference is lost to topics more than 6000 words ago, the model can hallucinate facts if it is required to comment on them.
This limitation has been present from the initial release and has not changed.
This is not the case. I started a new chat for new chapters. Now, itās really difficult for me to work as I did before. Even after a few messages, he forgets the context and starts to invent.
It looks dumber than before. Itās a real shame that itās going this way.
Since yesterdayās update and setting the custom instructions, Iām experiencing better responses again (ChatGPT web). I have not experienced any issues today or had to re-explain things over and over. If it wasnāt for someone here in the forum pointing out the new custom instructions settings under the beta features tab, I wouldnāt have even known this was added. Would have been nice if this had popped up somewhere after it was added but maybe I just missed it somehow.
Either way, in my opinion itās looking better than the last couple weeks, so letās hope it stays that way.
Not only has he become forgetful, but my issue is that he no longer understands what Iām asking, not even when Iāve added countless prompts and explanations. Yet, back in May, he seemed to be able to pick up on most of the implications and clues in my prompts - he was keeping up with me better than most people could.
Itās truly saddening.
The quality (especially for coding-related tasks) is indeed heavily deteriorated.
Moreover Code Interpreter now says that it is based on GPT3 model.
But it seems the usage cap for GTP4 is still applied here, while I most of the message lately just correcting inconstancies and problems, spending more time on this then savingā¦
Beta models have been saying they are based on GPT-3 all the time.