Check out that thread I did a little testing of my own and I’m honestly either not sure if I fully understand your code but I did my best. Overall, I got 1 crappy response out of three tries. Though I am not able to tell if the end result was satisfactory in the end. I need to get back to work, however I’m interested to see how youd grade the responses I was able to get
I was on an external production the whole of last week and didn’t use it for some 9 days until last night. I was appalled by the drop in quality, it felt like I was using GPT 2.5 or earlier. I’m a plus user and rely on it daily for work. The experience has been smooth and consistent since March, but I struggled with it for almost 24 hours yesterday. Eventually, I dropped it and did everything the “old” way (myself). There are many questions here, but mostly I feel ripped off by OpenAI as a paid subscriber.
100% same issue. Reported in details here
It is clearly now waste of time using GPT-4, my son, using for his 12th grade class help, for the past 2 weeks, screams (addicted before how wonderful gpt-4 was…) F… word…
GPT-4 Team wake up.
Today he tested Bard, it just solved the math issue in one go…
100%, I have felt it. Like they are reducing compute. I used to max out GPT4 and then have a break and continue when my 25 resets… it was worth it… the output for me was incredible. I find now I am really questioning whether I use what the model responds with. ( :sad, we used to be partners!) Now it’s like a bad intern who confidently tells me the wrong thing and I have to catch it all the time. Here’s hoping that they increase compute power or go back to GPT4.
6/27/23 … & ChatGPT4 still appears to be a shell of its former self (i.e., lobotomized).
I’ve seen a lot on the topic, I’ve also participated in the topic for the past month or so. GPT-4 was truly amazing at writing whole complex snippets of code logic before, using various prompting techniques + describing the task from a high-level overview was enough for me to get an almost perfect match of what I’ve needed.
Once they dropped the iOS App + the Plugins (which happened in a short timespan last month), it started forgetting context from messages which were literally placed 1-2 messages before the current one, the context length itself I believe was significantly reduced, because previously I was able to paste long (300-400 line) snippets into it, and it’d do very well remembering that with almost no reminders from myself for a fairly lengthy conversation, currently it doesn’t do that, you paste a snippet which has the name of some function, three messages below if you ask it about that function, the chances that it’ll start hallucinating something completely different have gone through the roof.
Don’t get me wrong, I am still able to get it to do what I want it to do, the issue here is that it’s become singificantly harder to do it. I’m actually a hard proponent of the idea to add “tiered subscriptions”, where the bigger tier subscription you have, the more computational power is allocated to your personal experience, in this way I know that at the very least, what I am paying for is worth the time.
For the moment, using the Playground for me works - being able to actually tweak the configs actually makes a singificant difference
I agree with the subject in this post and add my concerns about.
I have tried the GPT 0613 model capabilities with a series of tests and we are concerned about the loss of logical capabilities of this model from the previous version 0314.
One example: asking if a value of 4500 was inside the range between 4000 and 6000, the old model was capable of understanding this, but the new one isn’t. This creates a big trouble in understanding data inside HTML tables.
I strongly recommend OpenAi to extend the deprecation date of the 0314 model from the current 09/13/23 to a much longer time and to clearly declare if the current model 0613 has to be considered as the new state-of-the-art for the future (or if you are planning to fix the model).
It may be an issue with the plug-in as well. I randomly selected a plug-in to chat with Google sheet and the result was incorrect.
In short: the plug-in did not retrieve the expected cell and subsequently answered the question incorrect.
A few hours ago I made a similar post (I also encountered a wild bug once, with random text and code appearing and flashing colours on the screen), and more than that, last week I noticed that there was a survey in ChatGPT (where your account is, in the bottom left corner), and one of the questions was “Did you noticed a degrade in performance and quality?”… the fun part is that I started noticing this exact thing few days after the survey…
Hi the problem has started since this morning and it applies to an HTML table of contents that is in the middle of a text of 2400 words. The test you made is too simple!
Until yesterday evening this kind of request was easily answered by the model in a proper way. This was tested by us at least an hundred of times, simply because this was one of our testing phrases to check if our software was properly calling the API model.
Since this morning (with the ulltimate update GPT 0613) the model was not able any more to answer, while GPT 0314 is still able to perform this task in a proper way.
We have right now performed a different test of the 0613 model: before we were passing 2400 words of text per call (to the 0314) that was working (and still is working) very well… now we have reduced the number of words to max 1000. The result is that the model 0613 now (with less words) is capable to give the right answer…
Appartently this should be an indirect confirmation that the new model has “less capabilty” to stay on the topic when the number of words (tokens) is higher than a certain number o words (tokens).
Would you be willing to share the 2400 word with html table prompt that gave the error?
It’s not that easy. Let me check with the data owner.
BTW the problem appears not only with an HTML table in the middle of other text: performing additional tests, we found that the same issue occurs with a plain text of 2400 words (basically a couple of pages of a book without HTML table in the middle): the new model was unable to answer a straightforward question about the presence of a specific topic in the text. To solve the problem, I confirm that the only way was to reduce the quantity of text to less than 1000 words. With smaller Prompt & Text to be analyzed, it works fine.
This is not good because we planned to use longer texts (far more than 2400 words) in our solution as soon as the 32K engine was made available to us. Now all this looks not feaseable…
This would be extremely useful information to have, I have tried this morning burying a small factoid in a 10k token text with the 16k gpt-3.5 and it found it on the 3 occasions I tried it, I also tried the same idea with ~3k tokens of text in 3.5 standard twice and it found the factoid both times.
There is either something very unconventional in your data, e.g. potentially a boundary marker token sequence accidentally introduced or there is an issue with your code.
Without your example data I am unable to investigate further, if you can provide your API calling code (no keys )with any set up functions it relies on and the prompt that was used as input and the output generated that would be great.
Well, i was trying to get a chatgpt 4 to write me the main keywords of a few paragraphs… literally after 2 responses… it had a bloody seizure and started giving me COVID-19 keywords…the text i wanted the keywords for was about about dogs…and worst of all. this happened in to 2 different conversations… seriously gobsmacked
Counter opinion here:
I have been using a [currently private] plugin to facilitate complex coding tasks. My plugin is capable of writing its own code, doing complex refactors, writing specs and docs, etc.
I’ve been developing it on a nearly daily basis for about 2 months.
I have NOT observed any degradation in gpt4-plugin’s capability. Given that apparently “many people” have observed it, and yet nobody can quite provide before/after examples, I dare say it’s an issue of user psychology. The more you delve into gpt, and the more you use it, the more you will run up against its limitations. When people see this it’s easy to join in the echo chamber of “degraded quality.”
An example of a prompt that didn’t get the answer you wanted is pretty inconsequential unless you can show it behaved differently before.
A single factoid will likely work well – it doesn’t take a lot of the capability of the model. The challenge comes when you start stretching it, giving it 30 rules to adhere to, and it seems to run out of ability to pay attention to many rules at the same time. Or, in the “factoid” case, find 30 different factoids at once.
The “caveman test” is intended to provoke this behavior, and it’s interesting to see the model fail to correctly complete the first 10 sentences.
I have before/after samples, although they are not publicly shareable. This is part of an automated test suite. One obvious example was where it stopped putting code in block quotes, and instead just writes it out as inline text.
I have also (publicly) posted inference runtime graphs that show the runtime going up over time for gpt-4.