It sounds to me like you should stop paying $20/month.
I don’t see the context to a discussion about GPT4. What do you mean?
First, obviously this would only be able to be applicable going forward.
Second, there was no insinuation made or intended.
A not insignificant number of people do believe the model’s performance has degraded, but there are many countless others who do not, have no opinion on the matter, or haven’t expressed an idea one way or another.
But, there isn’t any clear, hard, empirical evidence to support those claims. There’s a lot of anecdotal reports of such, but it’s impossible to quantify at this point.
That’s why I said,
When I said,
the purpose wasn’t too impugn anyone, but rather to simply advise caution against trusting and relying on self-reported data—because of,
- Biases. Biases aren’t necessarily nefarious, we all have them. The big ones I feel might be at play here are,
a. Novelty bias. where the perception of ChatGPT’s capabilities were inflated and it’s limitations were overlooked when it was new and novel. As we’ve all gotten accustomed to it and continue to try to push it beyond its limits, its flaws become much more apparent.
b. Rosy retrospection where people just remember the past better than it was.
c. Confirmation bias where once someone comes to the belief the model has degraded, they tend to fixate more on areas where the model falters than where it succeeds. - Agendas. Agendas are bad and while it’s true that most people wouldn’t act on them some may and it’s better to eliminate that possibility as much as possible. It’s not a stretch to imagine someone who has a biased belief the model quality has degraded to go out of their way to find examples to demonstrate that only submitting bad responses because they consider good responses to be anomalous. Likewise someone who believes the model has improved may be motivated to regenerate responses several times to get a good one, chalking the failures up to randomness and not indicative of the model’s actual capabilities.
So you can be “embarrassed” to read what I’ve written, or you can stop to think about it and realize I’m just proposing a better, more robust way to collect data so that if it were done, people with beliefs about the trajectory of the model’s performance can be more confident in the results if those results don’t agree with their anecdotal experience regardless of what the results actually are because there’s less risk anyone put their thumb on the scale, so to speak.
¯\_(ツ)_/¯
I don’t know if this is directed at me or not, but I’ll take a stab at it.
I don’t think people are demanding proof in general and I certainly don’t feel I am.
I’m more skeptical than anything as I’ve not noticed any degradation myself and I’m curious to see examples of this because, if it’s true it’s interesting and important to know the specifics of where this rolling back of capabilities is happening.
I’m not challenging people’s claims because I think they’re wrong and I’m right but rather because I just want to know the truth of the matter so I’m not wrong going forward.
I’m trying to propose a collaborative system by which we, the community of users, can independently audit the effects of model changes in a quantitative way that is scientifically robust, so people who feel the model has degraded can have concrete, empirical evidence of such and we can elevate the discussions around model versions from nebulous claims and feelings to numbers and facts, hopefully moving beyond this tribalism where people entrench themselves on one side or the other, dig in, and refuse to budge because neither side can offer anything compelling to the other.
I like the idea. But, you know, what about someone from Open AI coming forward and stating some analytics data like: “the relative amount of negative feedback from all ChatGPT4 users is unchanged.”
This could be so simple, or am I missing something?
GPT4 is broken. At least on the web app.
I believe it isn’t receiving / processing my full 4,786 token message.
It continues from some random point in the message instead of following my instructions at the end.
It seems to be cutting off after 4,565 tokens?
At least that’s where the model said my article cut off
I tweeted at Greg Brockman hoping he would see it but no luck yet. If anyone wants to retweet it that might help. Twitter is @nfcodes
Who else is experiencing this?
To be honest, I have not noticed anything. I have always said I can’t really tell the difference between GPT-3.5 and GPT-4. But others have said that it was “markedly better” regarding my comment. I think it is possible people are just too used to chatting to it like a human and now they are starting to notice subtleties.
As a former game developer, I have seen this quite a bit. The users will be initially enjoying a feature but then after an update (even though nothing touched that feature) people start questioning and complaining about some aspect of it. Obviously, people should speak up if there is quality issues. Sam Altman led YCombinator efforts so he is no stranger to tough customers, but people should be aware of that phenomenon. Maybe just be more careful how you word things for GPT-4 and be sure to add appropriate context.
I’m a plus user too, honestly… chatgpt 3.5 was insanely smart during when i first experienced it, now it is almost shocking at how irrelevant the responses are, moving on to chatgpt-4 i was instantly mesmerised by how much smarter it was… until a few weeks ago… it honestly feels like i am now using chatgpt 3.5 - no difference whatsoever, what used to provide me with accurate specific information and following tasks with AMAZING memory, now suddenly just doesn’t… i’m using gpt daily and the the dramatic difference is really disappointing… maybe they are doing a study or something and randomly cursing people to see their responses, i hope so anyway
I’m actually looking at making a Chrome extension right now that would randomly pull some standardized prompts from a simple web server, submit them through ChatGPT, then grabs the response, and sends both back to the web server to be collated and analyzed.
That way anyone using ChatGPT can volunteer to run and submit as many prompt:response pairs as they want. If a bunch of people generate a bunch of data between now and the next update release we’ll have a baseline from which to compare when we collectively do it again with the next version.
It’s not something I really want to take point on, but maybe if I slap together a proof-of-concept version someone else will take it on.
But, I definitely won’t get to it until the weekend.
I’ve been getting this on the Browser for the past several hours:
Doesn’t just fail and say, “Sorry, can’t find anything.” Just fails and the only way out is to click “Regenerate Response”, which goes back to the same failure cycle.
These machines aren’t human, or even sentient, but they are trained to act like humans. The same way your mother could tell you were upset despite your best effort, your teacher could tell you weren’t applying yourself to your abilities, or you can tell your child is lying. Or your dog can tell when you are about to leave for the day. There is something about observing behavior over time that makes us, as humans, very capable of determining when habitual behavior has changed.
I see no point in all of us boring each other to tears with posts consisting of prompts and responses that only the each poster can understand. The point that I notice some unusual behavior, and other people who use this system as much or more than I report the same thing is enough for me to conclude something is amiss.
If you aren’t having any issues, that’s great. Despite my issues, I am amazed and feel extremely lucky to have access to GPT-4 (and GPT-3.5-turbo-14K). They both still save me tons of time, help me learn (at my advanced age) more and faster, and overall make my life as a developer a little easier than it was 6 months ago.
But, if the whole point of me having access to an Alpha/Beta project is to demonstrate to OpenAI how it’s performing, I would think that pointing out the problems I run into would be part of that process.
Ditto! I use the models primarily to assist with PHP coding. I understand that they are best at Python, but PHP is at least in the top 5 current programming languages, much older than Python, and must have been included substantially in their training data.
Like you, I used to be able to get fairly reliable responses from gpt-3.5, and the best from gpt-4. Now gpt-4 regularly hiccups and stutters and just plain old “hallucinates” and has become far less reliable. Now, I have fallen back on gpt-4 codex, which appears relatively stable. It also seems to learn a lot more quickly from it’s mistakes and is faster to circle around to correct answers – while chat gpt-4 more and more goes down rabbit holes.
We’re not crazy. Something is different, for sure.
For PHP development you still get some good results but it is getting worse because of missing new data. I am already thinking about embedding documentations of newer version e.g. Symfony and building a plugin so ChatGPT can ask it before it answers…
On the other hand when people complain about “hallucinations” I think that’s something that at least in programming is needed. Sometimes there is no library for something and you can barely find informations about it on the web either, but still ChatGPt hallucinates something for you that in most cases doesn’t need much work afterwards.
I even like that it hallucinates comments to your code haha…
Therefore, I unsubscribed. I subscribed to Plus very early on, because ChatGPT was very good at writing code. However, now it is clear, whether it is 3.5 or 4, they basically cannot complete the task of writing code.
I can assure you they can. But you have to learn how. And there are some limitations, but on a level that is way over normal application development.
Agreed. My entire Roguelike project was written with GPT-4 help… You have to know how/what to ask and give it enough info… and know enough not to be lead down bad paths… It increases my productivity as much if not more as using it for content writing (fiction)…
Writing stories and creating code are more or less the same activities.
That’s why solving the novel creation will also solve autocreation of whole applications.
*without the need for human interaction
I just posted this: OpenAI Doesn't Need StackOverflow or Reddit
It got me to thinking about the subject of this thread.
One thing we can all do is use the thumbs up/down icons to let OpenAI know when the models responses are good, and when they are terrible. I’ll admit I’ve not done this enough, but I’m going to start being a lot more diligent at it. Hopefully, over time, if enough of us start doing this we will see some improvements.
I subscribed to chatGPT back in March, because GPT4 was so good at writing code, now I have exactly the same complains as the rest of you, it doesn’t feel like GPT4 from March, it feels like GPT3.6
Even worse, the model can’t do simple things now, if I ask it to provide me a list of 10 things, it provides me a list of 3 things, and starts rambling about character limit! I don’t mind clicking continue button, I do mind a bot that refuses to do what I ask it to do.
Yesterday I created a simple prompt to extract data from a PDF and generate the result in json format, gpt-3.5-turbo returned the correct data. Today I tried the gpt-3.5-turbo-16k and it returned saying that was unable to extract the data. Now a few hours latter the gpt-3.5-turbo is giving the same result as of 16k, it is not able to extract the data.
Everything is the same, I just changed the model name.
I honestly think they just messed with the model parameters on chatgpt. Something like including a note to be concise in the system prompt and reducing the de facto token limit in the chat and max completion rate. It would seem to fall in line with what most people are complaining about.
But like, I really would like to see a side by side comparison at least one where the results seem worse. I’ve been doing my best to compare old prompts to the current responses on chatgpt and I just have not noticed anything.