They probably made AI “too-safe”? Idk.
3.5 turbo used to be able to play trivia games and remember my name after at least 6 prompts. Now its like it has no memory after 1 or 2 prompts.
Hello William, welcome to the community.
For me, the model just refused to play any game. xD
I’ve been talking about it on discord for a few weeks now and I even sent several feedbacks to openai but the response time is extremely long. It is absolutely visible, especially in code, that GPT4 is no longer as efficient as it was more than a month ago now. His ability to resume previous conversations is limited, and the responses seem much faster than before, certainly as has been said, they favor quantity over quality. For the moment I canceled all the subscriptions I had, that’s all there is to do I think…
IF they did switch the model and keep the name, it’s clearly a fraud, but no one will be able to prove since GPT-* models are based on random sampling.
Agreed. I see lots of complaints, and I can’t argue with sample size, but I personally do not see a degradation on either chat the UI, or the API or the playground. In fact, I’ve been doing a lot of Rust programming lately and have noticed that latter model versions give me much better Rust code that compiles 80% of the time or more, vs 10-20% of the time like a few weeks ago.
I also ask it comped creative tasks like reducing brand profiles, social gossip, news, and research, into creative campaign ideas and the response seem subjectively on par or better than prior models.
Just my own experience…
Yes! I’m glad I’m not the only one who has noticed this. GPT-4 is still miles above GPT-3.5-Turbo with respect to code, but it definitely appears less capable now than just a couple weeks ago.
Now, when I feed it code to analyze, it sometimes freaks out and starts browsing the web for God knows what, coming back with answers totally unrelated to what it was asked.
It also has developed a nasty habit of “hiccups”, where it will start writing code, then stop in the middle, say “I apologize.” and then start writing again from the beginning.
We’re not talking huge amounts of code.
And, as for people asking for prompt “examples”, it’s not that simple. You work with these things for several hours a day, every day, and you begin to get a sense of them, how they react to certain circumstances. It becomes very obvious when their behavior changes dramatically. You know, like humans.
Hate to repeat myself, but YES. Exactly! Ditto. I am working with PHP.
However, I will say that the Codex model has proven to be a bit more reliable with code work than the Browsing model – which I initially used in the hope that it could verify it’s responses. I don’t know what other users experience is with the “Powered by Bing” GPT-4, but in my experience it has a 50% - 70% failure rate.
gpt-4 now is going south, from amazing to ha ok category.
After 12th May release, (now 24th May) something really went wrong.
using notable plugin. (with browser plugin it is even worse)
before… asking to plot with some data set
after correcting…
[something wrong why y-plot value is so low, while the total ccf in 50k range?]
another one…
do u see any anomalies ?
response:
after correcting
Earlier versions of gpt-4 would find all issues in one stroke. now that is gone. I’ve to push it to specifics to find…
even after prompting
Another problem I’ve been facing, in addition to those mentioned here, is that now it tends to create a function signature and put comments like “implement the necessary logic here to do {what I asked}” instead of the code.
That is, it seems to be too lazy to implement the code I requested.
Do you think you could share some of the prompts you are testing with me? Would like to run it through our visual text network analysis tool to uncover the main themes and high-level ideas in good / bad prompts… Thanks!
I feel that OpenAI is treating its customers like criminals. Every time I submit a message now I’m getting a stupid ass captcha to verify every. single. message. To then get a message of an inferior 4.5 turbo that is incomprehensible or totally wrong and I have to reword it but don’t worry here’s another captcha! Don’t worry! Have issues? Gather a dossier of evidence for the OpenAI devs! We need to create a pepe silvia level diagram to satisfy the white knights on this thread trying to turn a blind eye to the quality issues clearly evident in the operation of GPT-4 in the past month. I’m trying to be patient about it. I know OpenAI is experiencing explosive growth and I give grace. But the management of communication with people about the product is just abysmal. Don’t charge people $20 a month and then also essentially have a customer service experience akin to smoke signals then.
If you turn off your browser plugins (privacy badger, adblock, etc), do you still get the captcha?
Just a suggestion for those of you who believe the model quality has (or can) decrease with subsequent updates.
Benchmark it.
- Curate 10–20 standard prompts of a type which is of interest to you.
- For each of the prompts,
a. Identify what you consider to be the platonic ideal response.
b. Design a 5-point rubric for grading each response in comparison to the ideal.
c. Generate 10—100 responses.
d. Grade each response and compute summary statistics. - Compute a final score point-estimate and distribution for the model.
If you do this process with each new version of the model you’ll have quantifiable and testable claims when you say the model has been degraded.
At ~1,000 tokens/test GPT-3.5 would cost about $4 to test a version (though GPT-4 would be about $120) through the API.
That said, if there’s a genuine concern there should be enough people willing to crowd-source an effort where everyone volunteers to let someone run a couple of benchmarks using their API key, or possibly host a page where people can enter their API key and select the number of runs they want to donate.
You’d want one trustworthy person/organization to handle all the requests, otherwise you’d need to trust that people with biases and agendas (either side) aren’t cherry-picking the results that support their position.
If it gets democratized enough you could eventually crowd-source the submission of new prompts.
Tldr is the last paragraph at the bottom.
I understand where you are coming from and what you are saying. This is a developer forum and everybody who visits here regularly likely has the skills to perform these tests with ease and considering the good faith some members of the community are bringing towards this truly amazing tech we should provide hard facts with no problems whatsoever.
On the flip side many of us have access to GPT4 via a paid subscription only, you know the Plus thing. People subscribe and remain paying users because the experience is great. Period. Suddenly these users from various corners of the net start reporting issues. Having a marketing background I know exactly what that means. Consider that the people reporting issues have been using the product for months and have been paying because it’s worth it. The same users suddenly report that it’s not worth it anymore and the reports are accumulating. I call this a hard fact. Still people with a different background may differ on this. I understand and accept that.
But let’s focus on the data collection process. Of course I can dig into my past conversations. Different conversations regarding different projects start with the exact same prompt and I would get a comparable answer every, single, time. This has changed. I can now dig deeper into my data export, collect all the answers and pinpoint the exact date this occurrence appeared for the first time and how this kept repeating for about a month now. Then pack it up nicely and send it to a random OpenAI team member without expecting any answer whatsoever. Which is fine. My dataset is small, unstructured and in no way representative. Furthermore I understand that the community managers are executing this role with a low priority due to immense workloads.
[Edit: removed because it’s unnecessary]
So, where does that leave us now? And what does it mean for the discussion about these observations?
What I am saying is that as a paying beta tester there is only so much additional complication I am willing to accept in order to drive somebody else’s project to commercial success. I hope this makes sense and that nobody feels offended as this is more aimed at clarifying a different perspective on this topic.
@vb, thanks A LOT for this clarification
Interesting proposition but difficult to implement because:
- it is impossible to access previous versions of the model ;
- the available histories are likely sparse (as historical management is a disaster and therefore encourages rapid deletion) and unstructured to anticipate tool degradation ;
- OpenAI has powerful evaluation tools (those that have enabled all these marketing presentations) and should therefore be able to demonstrate what is needed
I’m quite embarrassed to read in your message :
“those of you who believe”
“you’d need to trust that people with biases and agendas (either side)”
This seems like an assumption or insinuation of intellectual dishonesty on the part of the (numerous and competent) individuals who have observed significant changes… On my part, I have neither (wanted) bias, nor an agenda.
I have no clue. I was able to resolve the issue with troubleshooting steps, but it returned later and disappeared again. But like, you’re missing the point. I’m not incapable of diagnosing the root cause of my issues. I’m also not incapable of resolving said issues through troubleshooting. I could provide examples and troubleshooting steps, but I’d venture to argue that the cap of 25 messages per 3 hours already imposes major obstacles, and I honestly don’t have the energy to do a full qa test every 5 minutes. Nobody is asking for perfection. Literally communication. And don’t come at me with the whole “oh they have posted x on their discord” like are we in the Hitchhiker’s Guide or something? Many people like myself are smart enough to realize the difference between some guy going “GUYZ THEY ARE NERFING GPT!!!1111!” because it won’t let them be racist and folks who’s understanding of the process and ability to mostly solve issues with the model that are noticing a MASSIVE increase in corrections they’ve had to make.
This entire time, OpenAI is posturing their paid subscription as this premium experience when in reality it feels more often than not the premium experience of ChatGPT Plus is having white knights tell you how you’re just using it wrong. So like, I get there is a process, but don’t advertise plus as it is and then just don’t ever say anything to plus users ever when they have legitimate concerns. If I don’t have access to code reviewer, if I don’t get an increased message cap, if I don’t get easier plugin management, if I don’t get access to higher context management, if I don’t get access to uncensored output, if I don’t get access to FULL GPT-4, if I don’t even get a message refund when openai times out, at the absolute minimum $20 a month should ensure users don’t have to deal with issues like a captcha coming up every time they press enter.
What am I paying $20 a month for at this point? I come view people complaining about similar issues only to be told to go do the troubleshooting themselves. Is that what they mean by OpenAi? QA is open source?
I can not agree with you, the GPT3.5 API has also been “degraded” somehow.
As it is consdierably cheaper and very easy to test.