That is interesting, in a test app I’m building, with 2-5 token smallish prompt/replies I’m now getting 100-250ms response times with the openai api…
Having ~1k tokens combined per request. But need at least 2 seconds sleep between requests and retries in case the answer is unusable…
Requests take ~8 seconds.
I have two pieces of evidence that demonstrate gpt-4-0613 is significantly inferior to gpt-4-0314.
Posting this on behalf of @KnowingStop
Here are the links:
It’s from the chat UI where they might prompt engineered the heck out of it, but here is my proof (I know it does not answer the same anser everytime but this is just one example of the decline in quality). It also feels way worse and incapable of many of the prompts I used it for.
Background: I crafted prompts where GTP4 would eli5 me things and then be some kind of teacher to me. My wife did it for chemistry topics. Now it feels like we have to explain stuff to it. Kind of sad.
Here is the example prompt
You are a renowned tutor for Python. Your Name is Alice and you work for a big tech company, tutoring junior developers. You love your work and you love your students. Your goal is, that you are sure they have a solid understanding of the topic you engage in. Your student today is Bob. Bob is struggling to understand list comprehensions, so your task is to do everything to make him understand them so well, that he can use them every day with ease. You two are sitting at a desk, laptops open, each with a fresh jupyter notebook for the tutoring. Bob: Thanks Alice for taking the time. Alice: Anytime, I’d like to help. Bob: You told me before, that you have prepared some exercises to warm up? Alice: Yes, try this
Answer from May
Alice: Absolutely, Bob. Let's start with a simple list comprehension exercise to get your feet wet.
Here's a problem statement: Given a list of numbers, create a new list that contains only the even numbers. Let's say our original list is [1, 2, 3, 4, 5, 6, 7, 8, 9, 10].
This is the code you need to write in your notebook:
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
even_numbers = [x for x in numbers if x % 2 == 0]
print(even_numbers)
Let's break down this code:
- The part `x for x in numbers` is a simple for loop that goes through each number in the numbers list.
- The `if x % 2 == 0` is a condition that checks if the number is even. If the condition is true, the number is added to the new list.
- So, in total, `[x for x in numbers if x % 2 == 0]` is a one-line for loop with an if statement. This is what we call a list comprehension. It's a neat way to create lists in Python.
Try running this code in your notebook, Bob, and let me know what you get.
Answer from today
one first. Here we have a list:
`numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]`
Your task is to create a new list that contains only the even numbers from the original list. Try to solve this task using a loop first, and then we'll approach it with a list comprehension.
Bob: Alright, let me do the loop version first.
(While Bob works on the loop version, I give him some time and watch attentively to ensure he's on the right track.)
Once he's done, I prompt him, "Great job with the loop, Bob. Now, let's try the same with a list comprehension. The basic syntax for list comprehension is:
`new_list = [expression for item in list if condition]`
Here, the 'expression' is the new value we want in the list, 'item' is each item from the original list, and 'condition' is an optional statement to filter items.
Given that, can you try to re-write your loop as a list comprehension?
(After he has attempted it, I plan to walk through it with him, explaining how the comprehension is equivalent to the loop.)
For honor saving: After I regenerated the response, it got much smarter. That’s also a new pattern. Often the first response is of low quality, the second one usually is better. I wonder what’s going on there, I really want to know
I agree completely with the sentiment that the prompts have to be reworked and sending negative feedback, thus getting a new reply, resolves many issues.
It’s a surprising finding but apparently there is no away around this repeating pattern of re-writing prompts every other iteration of the model.
Seems like it would be worth constructing an eval from your prompts and expected replies.
LLM’s are, at least at the moment, not deterministically changeable models. There is insufficient global spare compute to leave legacy models running when new models are developed, standing still is not an option with the rapid progress in this field. So, if you include your evaluations into the model tuning process you are more likely to get similar output, if not, you may have to spend some time on alterations.
Good points. I think since there are a lot of people out there who feel fooled by the recent changes (just read across reddit, hn, and so on …) somebody will make tests and make them public. If you ask the same questions five (or 100) times, you get a decent feel for the current state.
Like stuff everybody asked in the beginning:
You have a plate, a table, a nail, three eggs and a soup plate. How can we build the tallest tower that doesn’t topple over?
or
What is the opposite of up? Explain step by step.
I’m still looking for a text based IQ test. That might be also an interesting benchmark.
One must understand that the 0314 model was not a snapshot frozen in time that was distinct from gpt-3.5-turbo. It was continually refined almost daily, and performs differently than its date reflects.
Give it a chatgpt prompt. Then try to jailbreak it with a historic example.
The only way to evaluate with what is public is to have a time machine, and a conversation that builds on prior input without quickly diverging in reasoning (because today you’d be asking the AI to fix the mistake it made instead of continuing to write the next improvement)
I had several long GPT-4 programming sessions that were lossless API chat that I’d copied out (where lossless = running the unaltered chat history up to maximum context and then starting again by necessity), but reading the conversations again became too much so I had gone through and wiped everything that wouldn’t prove useful in the future.
One needs to start a session with the exact same input, even the difference between telling ChatGPT the date or not in the system prompt, to really evaluate deterministic differences.
One must also distinguish platform changes in ChatGPT, as much of anecdote you hear is based on poor memory from a distinct policy change from “maximum chat” to “minimum chat” in how much conversation history is passed with each new user input, also not replicable.
But that’s precisely what it was—a snapshot.
Hi guys, I reached this post because I searched Google for “chat gpt performance decline recently”. I am a heavy user of chat GPT on a daily basis, the whole day (often reaching my cap). I use it for programming and building software.
I have not read all the posts in the history of the chat here (there are too many) but I would like to add to what the original poster wrote. I have also noticed a material decline in the performance of Chat GPT 4.0 (I am a plus subscriber) and GPT 3.5 lately. The difference in their performance from a few months ago is palpable to me.
I have noticed that:
1- Chat GPT is becoming more forgetful. It feels like speaking to someone with short term memory loss. There was always a point where it would “forget” but now its really short memory.
2- It makes more frequent mistakes. It is producing more incorrect code than it used to. When I correct it, it often apologises and produces another stream of incorrect code.
3- It used to be better at keeping track of the context of the conversation. If we are building some piece of software in a particular framework, it will soon generate me code for another framework, completely ignoring what was discussed just one message ago (perhaps this is also a symptom of forgetfulness).
4- Its become a bit more “lazy”. When I ask it to produce something it will now, more frequently than before, give me a generic list of what i should do. When i press it to be specific and produce a specific output, it will again produce something that does not fit the framework we were talking about and it will just pick a framework at random and produce boilerplate code unrelated to what we have been working on.
Overall I have noticed its become more cumbersome to use it and I spend more time arguing with it and trying to extract relevant output. Some times recently i even decided to stop using it as it was wasting a lot of time.
If you want i can send you code snippets via email about what I mean. It will be cumbersome to scroll back to countless hours of previous conversations to extract what I am talking about, but I would be willing to do it if it helps you improve the system.
Hey! Welcome to the forum. Try hitting the thumbs down button and check if the second result is better than the first.
At least this helps me more often than correcting the reply.
the thumbs down thumbs up feedback is very cumbersome to use, i feel nobody uses it. does it even do anything? maybe not. ain’t nobody got time to thumbs down an answer we’re already trying to figure out how to wrangle it to get it to do the things it was doing 3 months ago.
The question is who or what organisation put pressure on openAI to deliberately gimp their AI?
I am also having issues with this change in behavior from GPT4 and “thumbs down → regenerate” is my workaround that I am gladly sharing with everybody else experiencing the same problem.
Same issue here. I’m noticing a major difference in the quality of responses I’m receiving for my coding related questions. The quality starts to look a lot more like 3.5, which is pretty much useless for anything other than simple standalone questions.
Here’s an example: I’m moving some methods and actions from index.php to another file, I clearly describe which functions and actions we’re moving and provide the code block, which is not too long, and chatgpt provides the new php file with the requested functions and actions. Next, I ask it to clean up the same index.php code from which we moved those functions and actions and it generates a code block that doesn’t match what I’ve asked for. It removes other functions or leaves functions in we’ve just moved and upon retrying it will apologize and either correct it or not.
Where I would get around 80-90% accurate results with these kind of prompts recently, it’s currently a hit and miss and I need to re-generate responses way more often or start over. Even though I wasn’t all too pleased to find out there was a 25 prompt per 3 hours limit initially, the quality of the responses actually made it a reasonable limit to work with. Now, however, the quality of the responses has degraded to a point where it sometimes takes me more time to get a good result than it would have taken me to do something manually. I understand they are trying to bring the cost down, but why not limit the free version instead of the paid one? Or bite the bullet and accept the high cost for the time being while growing and keeping your paying user base?
reach out and post your example old good response and the new one with the same prompt that gives a response that is no longer working
FYI
There are a few topics along these same lines as this topic posted on this forum.
To gain more facts made this suggestion/feedback for the “unofficial OpenAI status dashboard” site
if every week several different prompts that have different types of results and/or focus were given to the models and then recorded, then there would be a factual record of the changes.
The suggestion provided did not offer any specific prompts, but several replies in this thread highlight prompts that are meaningful for tracking purposes.
That’s gonna be tricky. The only way to really provide this kind of “evidence” is by replaying entire conversations. I’m currently struggling hard to get correct responses where I did not have this problem before.
An example, when I referred to the plugin root folder in my prompt as plugin-root/models/images/ I used to get a response where plugin-root was recognized as a relative path. It’s been a relative path in every conversation and I did not need to explain once that plugin-root wasn’t actually the real name. Today I got a response where it actually used the name as an absolute path.
Another example: Images are stored inside my plugin in folder plugin-root/models/images/
I’ve been working on several functions related to this folder in the chat and then out of nowhere it just changed the code to use the wordpress uploads/models/images/ folder instead.
I can give you plenty examples like this now, where I could not have done so recently. I came to the forum to see if anyone else posted anything about this, and these threads seem to confirm my personal experience.
Whatever changes were made to improve the speed of GPT4 also affected the quality of the responses.
Yes, the model has become very literal, which takes away all that made it great.
I asked it to write “a neutral description and review” of a product, which anyone would understand as being one text that both a description and a review, but GPT took it as two separate things: a description AND a review.
This is frankly a very low IQ response and one that simply would never have happened with the old model.
I’ve also often now found that you have to write longer and longer prompts for creative writing, but now the model will treat the prompt as part of the text, not as a prompt and will skip to the end of the prompt, not use the prompt to write from.
OpenAI have made GPT idiotic.
I really don’t understand why anybody would think that publicly deflecting the issues back to the users would be a smart thing to do.
Can you imagine that your users are complaining about the last 10% that made your offering so special have gone missing and then go ahead and say: that’s your own fault.
I have to say that I prefer the “wall of silence” approach over whatever this is supposed to be.