Some initial tests gave mixed results. Will try further. Question: did you try fine-tuning?
you tried 3.5? Not sure what you mean by what you highlighted
Yes, I tried. Gave some examples in the prompt, asking for a similar output. Got good results, but sometimes not so good. It was an initial test, gotta test more for a proper evaluation.
Finetuning: https://platform.openai.com/docs/guides/fine-tuning
this could explain lower cost and faster performance than 4t.
the gpt4o model produces worse outputs across the board and is hilariously stupider than claude. I have claude running autocoding itself into ABM and GPT 4o cant figure out how to get past its own tool template.
So don’t use it. I don’t see what the problem is.
welp. dono what to tell ya
Interesting… I took an excerpt of english text from the OpenAI page just now, and opened 3 tabs with ChatGPT, each with a different model.
To my surprise, the tab with the 4o model produced exactly the same translation as the tab with the 3.5 model, and only the 4t model produced a different translation (and better than the other 2).
more like a gpt-3.5o then? One test or few observations wouldn’t prove anything, but if true, the real implications are very profound. Like; it’s not half-price, it’s actually x10 more expensive with the same speed. But more importantly, Openai would be dishonest, which means humanity is in danger (and I, being an Openai fan, really hope it’s not the case).
@ 2:15
Personally, I’ve got pretty good results with gpt-4o vision, and another scenario where it had to produce text report of data. Previous model, gpt4, didn’t pay attention to details in request, and also added some dumb jokes to it. Don’t get it with gpt-4o, so I stick to it.
Indeed all the demos look amazing. That’s just not my experience so far (though I must admit I have interacted with gpt-4o only via API and Replit’s AI-chat-engine).
You have described a situation I myself have experienced with both 4o and 4, i.e., the same prompt is understood by one model and massively misunderstood by the other model. I did not experience such issues before the launch of 4o.
4o is very dumb in complex issues. It just cannot combine simpler concepts together. We reverted back to 4 turbo. It feels like THE new version after using 4o.
You get what you pay for. It’s not cheaper for no reason.
I have to agree with the OP, gpt-4o struggles where gpt-4-turbo succeeds.
It’s completely failing on a completions based categorisation task for me, where turbo is very good.
(yes, I’m aware that’s a non-standard approach to categorisation, but I actually find embeddings frustratingly poor at some types of categorisation problems)
Shame!
Look at the text evaluation graph at OpenAI. Website: https://openai.com/index/hello-gpt-4o/
Scroll down the page to find the benchmark performance.
This proves that GPT 4o is the best.
Real world experience is not a benchmark
I think you’re responding to a troll.