For me, the decreased performance is noticed from Day1 after Plugins become Beta. There is something wrong with that model
Below is a sample conversation to generate some Python code:
It doesn’t even able to generate proper code (which is supposed to excel at) with the right instructions after few ToT like prompts, basically didn’t follow resolver plans or even come up with a better plan.
You can measure many use cases where I see the GPT-4 plugin model so uncapable most of the time. I find that the GPT-4 model without plugin reasoning and instruction following to be better than the one currently used in Plugins.
I don’t know why though.
And I really miss 3.5Model with plugin much, it was really that good for many of the tasks was thrown at it. Take a look at the same prompt sequence above using 3.5 model not GPT-4 without plugin which I think will fare better, but I hope you get what I mean about something specific to the plugin model after it was made Beta.
You can even tell that 3.5 was also smarter and step 10 and 11 already inferred that it is not applicable.
I have tried with the plugin model for GPT-4 all possible prompts techniques, CoT, ToT, AutoGPT kind, Pseudo-code prompts. and it fails miserably in following the instructions fully and sometime even come up with the proper cohesive plans.
[Edit]
Here is GPT-4 model response without plugin, Still 3.5 response was better.
I hope that make it clear, but you can measure many tasks at the same ballpark