I see a lot here people talking about how the new gpt-4-turbo models don’t perform as well as the old gpt-4 models, and I’ve even had some degree of that feeling myself, but I’ve recently come across some benchmarks on it (surprisingly few) and at least on the LMSys chatbot arena benchmark, the GPT-4-turbo models outperform the old gpt-4 models by a good chunk, including on the coding subsection of questions. It’s a crowdsourced preference evaluation, so it’s not perfect, but it runs contrary to the ocasional comments I hear.
@OpenAI I’d personally appreciate it a lot if you were to take the initiative to submit your turbo models to benchmarks like Gaia, MMLU, Human Eval etc and to publish those results. I’d imagine that’s a pretty common desire among developers trying to make decisions between models. Looking at other comments if you have good reproducible results on these, it’d probably good to release, just looking at the speculation around the topic coming to the conclusion that it’s worse.
Here’s the LMSys leaderboard — I think the only benchmark I found that evaluates the GPT-4-turbo models (if anyone else found others, that’d be awesome if you could share).
I’ve determined that I only trust those benchmarks that say what I personally believe, and in the face of overwhelming evidence I will rationalize about why I’m still right.
This is my current rationalization: GPT-4 and turbo have different strengths and weaknesses. In terms of shallow conversational ability, I don’t doubt that the newer models may perform better. But repeated (personal) evaluation consistently shows that newer models (can’t speak to the 04 model yet) struggle with certain tasks.
The question is: do you want a conversationalist, or a workhorse?
A significant oversight of these benchmarks is that they present the evaluator just with a chatbot, and those responses that stimulate the user, such as by repeating back their own input to them, are what gets a choice when deciding between two outputs, usually a single shot.
What is not evaluated is the steerability and the performance when instructed and preprogrammed by a developer on advanced tasks, and the ability to maintain context awareness of a new goal, information injection, and long inputs. That’s where you’ll find the performance slashed - in actually DOING.
Not my experience at all for anything I’ve thrown at gpt-4-turbo.
I don’t trust the benchmarks. Mostly because I don’t understand their exact mechanics, and also partially because my personal experience is just that, my experience with it.
gpt-4-turbo is overly verbose, consistently ignores key logic, and I have to say, for coding it’s almost unusable. It doesn’t take system messages as well. Writing is better than coding, but still worse than gpt-4. It reminds me of 3.5 turbo.
So whatever the benchmarks say, all other things being equal in my setup (i.e., my personal benchmarking), gpt-4-turbo is worse and there is no reason for me to switch off gpt-4. The only nasty downside is the limited context window. But hey, that’s just me.
Do you want an AI that chats it up with post-justification that doesn’t help reflection on the answer, or do you want an AI where the first token out performs correctly?
I’d be ok with dropping 0314 for opus. It’s probably even cheaper in the long run, but I’m azure sponsored and not aws sponsored, so it’s not a decision at the moment.
I hope OpenAI can get their 'ish together and produce high performance models, but it’s hard to say whether they actually intend to.
I mean for coding, probably the first, but for general Q and A use as a consumer product the second, which is what they’re going for I think. What do you think about strict prompting on the new turbo models to be concise?
And I’ll say that’s still true. But for another example, I attempted to have Opus write some Django backend/database stuff for basic Google SSO, and if I had used its code, it would have borked my entire backend. I switched to gpt-4 and the model got it right on the second go.
At first I thought it was that gpt-4 was more “logical” but I dont’ think that’s the right way of putting it. It’s odd. Maybe just a difference in training data. I still use Opus for most things.
Here’s a benchmark on opus versus gpt-4 and 3.5. No releases on gpt-4-turbo, so sort of a black box on that. I’m hoping a lot they will release the turbo benchmarks, maybe even if they do it later after another model release that can outperform.
For a (directed) workhorse, it fails spectacularly at tasks I need.
For a conversationalist, it performs superbly with some caveats (for some topics, it is outright stubborn and unyielding) and becomes a total failure.
Currently, I only use around 1-5% of chatGPT and already unsubscribe to all / will not pay for the API access anymore (in my other accounts to avoid any account flagging if they decide to ban this account) and already migrate to other systems.
Probably a combination with other AGI systems would produce more flexible, faster and much cheaper than fully using OpenAI products.
Be careful with Claude3 Opus though, they can ban you on certain keywords. But, it performs much better than chatGPT 4 currently (and even with the newest GPT4 alternative)