GPT-4 is still much better for our complex tasks that require careful reading and proper prompt following. GPT-4-Turbo was OK for remedial tasks or “conversation” but we use GPT-3.5-turbo for that.
GPT-4o is very bad compared to GPT-4 and even GPT-4-turbo for our uses, but we switched to GPT-4o anyway because of the price and have our scripts filter out the terrible outputs we receive sometimes…some of the outputs are random strings that have nothing to do with our prompts. Once 4o gave us information on a Boeing plane specs randomly.
Frustrating to see leaps forward in Image reading (4o is GREAT at that) but large steps back in complex analysis or tasks.
One of our simplest benchmarks is whether a model can answer a Multiple Choice Question of “All of the following are TRUE, EXCEPT:” on a semi-complex topic.
4 fails often but the rest of the models fail every time.