The latest, gpt-3.5-turbo-1106 is horrific. I haven’t had to rework my prompt format much for some of our projects (just small experiments plus the transition to messages) since starting with GPT-3, and got pretty consistent and reasonable results. The initial step into turbo was actually a fairly big jump up in quality of responses, with each newer version seeing what felt like slightly degraded responses as time went on. I ran a few experiments over the last few days, with an insane amount of responses purely being “I’m sorry, I can’t do that.” in some form or another with about 5% actually being relevant.
I rolled back to the previous functional model of gpt-3.5-turbo-0613, but eventually it’ll be deprecated and I’ll be forced to upgrade. Looking at other reactions to the latest models, I can’t help but wonder, wtf is going on over at OpenAI? It shouldn’t even be their overzealous content filter, none of the tests I ran had anything even remotely mature, violent, or otherwise… just basic conversations.