I can understand that they want examples, but many of us here and on other forums state very clearly that reasoning skills are greatly diminished, the constant loss of context, throwing the task back to you, more often pointing out its limitations and constantly apologizing for its errors and omissions.
Simple-prompt users won’t see this, but clearly those pushing gpt to its limits are adamant; we can no longer use complex or programmatic prompts, constantly “forget” directives or modify the content of the work or give an incomplete answer.
I was impressed from February to May and now it’s so bad there’s little benefit to using it instead of doing the task ourselves. More frustration than result.
If no openai developer experiences and sees all this for themselves, I don’t know what kind of help or example we can give you. So far, open ai is not at all transparent on the issue.
If you give the AI more organized and useful information, it will understand and solve problems better. Unfortunately, I can’t share the actual prompts because they contain confidential information and are being used by my company for our customers.
But I can tell you that we still haven’t found any major issues with GPT-4-0613.
Sometimes, AI doesn’t know what we humans know as common sense, and it can overthink things or make mistakes that we wouldn’t.
So, it’s important to include relevant information in the prompts to help the AI avoid getting things wrong (known as hallucination). However, we should be careful not to give too much information because that can confuse the AI too.
GPT-4 is not an AGI, so we need to manage the amount of information it can handle to ensure it understands and responds correctly.
Back in March it was spitting code like a decent developer, now, few days ago I asked it to serve a robots.txt file in my django project, literally one of the most simple things out there, and it took it 10 messages or so to create a “working” solution that is actually pretty bad.
So yeah, it’s safe to say the model has been lobotomized.
why are we supposed to work for free for a Company that sells a PLUS service to access the “most” powerful model and they decreased its performances in reasoning, coding and context retention. They need to give us what we are paying for, period. I do understand there might have been moves related to save computational resources due to the vast amount of clients making server requests: fine, let some of us pay more for the real powerful model then.
Stanford University and UC Berkely recently conducted a study. It was done by Lingjiao Chen, Matei Zaharia, and James Zou.
They found that GPT4 performance had downgraded the past several months. Here is an excerpt of the abstract:
We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time.
For example, GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 97.6%)
but GPT-4 (June 2023) was very poor on these same questions (accuracy 2.4%). Interestingly
GPT-3.5 (June 2023) was much better than GPT-3.5 (March 2023) in this task. GPT-4 was less
willing to answer sensitive questions in June than in March, and both GPT-4 and GPT-3.5 had
more formatting mistakes in code generation in June than in March. Overall, our findings shows
that the behavior of the “same” LLM service can change substantially in a relatively short amount
of time, highlighting the need for continuous monitoring of LLM quality.
Their paper showed that on March 2023, GPT-4 had 52% accuracy in the “Directly Executable” code generation category. But the June 2023 version was just 10% in accuracy.
This is BAD
Edit:
Actually for the coding part, the paper appears to be criticizing the fact that it adds markdown to the output more than before, rather than straight up code. So to them, this means the code isn’t “directly executable” This is a strange thing to compare and complain about…
The paper at least covers other aspects, not just coding.
They classed the model generating mark down ```'s around the code as a failure.
I’m sorry but that is not a valid reason to claim code would “not compile”. The model has been trained to produce markdown, the fact they took the output and copy pasted it without stripping it of markdown contents does not invalidate the model.
LLM’s have never been good at prime numbers, or numbers in general, they are not large math models. It also seems that they have only ran a single example for each test with a temperature of 0.1 which is not deterministic, that will lead to errors, there are lots of examples of this throughout the paper.
That is a disingenuous way to phrase this. Three people including from Stanford and Berkeley wrote a paper, the lead author of which is a student. Which means this is a paper written by one student with two faculty advisors.
The paper has some pretty substantial flaws in its methodology, one of which I noted in another thread.
I’ll be reading the rest of the paper more in depth today, but the poor work on the mathematics question leaves me skeptical.
i completely endorse that the performance of GPT4 is deteriorating instead of improving as time goes on. Lingjiao Chen, James Y. Zou, and Matei Zaharia conducted a study to measure the performance of GPT4 over a period of time and found substantial changes, particularly a noticeable decline in its ability to solve certain problem-solving tasks.
source.
They didn’t conduct a study over a period of time longer than that required to ask the chatbot some questions. They just switched between the two available API models of the same generation in the present day. And scroll back five posts - you can see it was just discussed.
Also note the misunderstanding that markdown allows easy copying of executable code within the ChatGPT interface.
I was testing it for about half a day and I can see the difference between its actual responses and the past ones. I asked The same question to GPT-4 and Google’s Bard and Bard was better in its response.
So yes, it’s very true, and I think most of us have noticed the quality deteriorating over time. I am a firm believer that this is due to the censorship at OpenAI, which has caused the product to worsen over time.
We see the same thing at other companies like Google, where services not just on YouTube, but even regular searches have degraded due to political interests. For example, real searches no longer yield what you’re looking for, instead delivering propaganda.
I think Barack Obama would be a great example. Over time, if you were to search for Barack Obama’s youth, you would previously have seen two infamous videos, which are no longer visible.
The difference between Google and OpenAI is that Google was useful for years, and when it started to degrade, it took years for customers to begin to switch. However, once there was competition, people started to switch quite quickly.
OpenAI’s ChatGPT is fairly new and it has already demonstrated its utility, but it began to censor and worsen the product right from the start.
If OpenAI truly wants to make a comeback, they need to remove the censorship routines from their AI. The product needs to regain its usefulness and, over time, it could begin to slowly reintroduce censorship and distribute propaganda.
I’m curious, did you find it sound reasoning that the paper used the fact that 0613 produces ``` markdown block headers for code (so you get the code blocks in chatgpt) to say the working code within those blocks was invalid? i.e. that because 0613 does this
print("this is a test")
and not this
print(“this is a test”)
The entire document if filled with poor methodology and errors, I’d be happy go go through all of them with you in detail.