GPT-4 is now completely unusable for me and I am not quite sure what OpenAI’s strategy is. Have they provided a reasoning anywhere? I haven’t seen them saying a word about the issue publicly, but the performance numbers and just the quality of responses make it incredibly obvious.
The difference is so stark I genuinely feel insulted by their response. Hanlon’s Razor just doesn’t cut it for me anymore.
To quote Peter Welinder on the 13th of exactly one month after the release that seemed to have broken GPT-4:
No, we haven’t made GPT-4 dumber. Quite the opposite: we make each new version smarter than the previous one.
Current hypothesis: When you use it more heavily, you start noticing issues you didn’t see before.
I don’t want to call Peter out here personally, but I do think the moment OpenAI made promises as to the performance and offered the AI for use in customer products they have, if not the legal, at least the moral obligation to try to fulfill this promises. That is what doing business in good faith is about.
Shortly after that quote researchers at the Stanford University published a paper (is linked in this thread) with the conclusion that GPT-4 has seen very severe drops in performance since the june update. It was worlds ahead of GPT-3.5 in every single task before, now it is almost an exact 50/50 in the number of wins GPT-4 takes and GPT-3.5 takes (although, overall GPT-4 is still clearly better).
I must say, it is very hard for me to trust in OpenAI with my projects. They are not a non-profit anymore and have seen incredible revenue growth and public support. It was probably one of the most beloved companies in the first half of this year, which is worth everything when selling to consumers, and they managed to ruin that with many customers by first bringing a product that revolutionized how tens of millions work and then making it utterly unusable, only to state that it is normal variance between tasks.
Many obvious and easy solution were discussed in this thread, but they just do not seem to care. The data is clear. I personally have countless prompts that make the drop incredibly obvious, there is no possible way they are not aware of it in their data, yet they choose to stay silent.
Not sure what to think guys. I tried out the March version of GPT-4 from their API but I have never used it before, so no reference point. Did it also see this drop in performance or could I replace GPT-4 with the API version until GPT-5 or a different LLM arrives?