GPT-4-Turbo models perform better the older GPT-4 models in LMSys benchmark

duncan.haywood · April 13, 2024, 2:36am

I see a lot here people talking about how the new gpt-4-turbo models don’t perform as well as the old gpt-4 models, and I’ve even had some degree of that feeling myself, but I’ve recently come across some benchmarks on it (surprisingly few) and at least on the LMSys chatbot arena benchmark, the GPT-4-turbo models outperform the old gpt-4 models by a good chunk, including on the coding subsection of questions. It’s a crowdsourced preference evaluation, so it’s not perfect, but it runs contrary to the ocasional comments I hear.

@OpenAI I’d personally appreciate it a lot if you were to take the initiative to submit your turbo models to benchmarks like Gaia, MMLU, Human Eval etc and to publish those results. I’d imagine that’s a pretty common desire among developers trying to make decisions between models. Looking at other comments if you have good reproducible results on these, it’d probably good to release, just looking at the speculation around the topic coming to the conclusion that it’s worse.

Here’s the LMSys leaderboard — I think the only benchmark I found that evaluates the GPT-4-turbo models (if anyone else found others, that’d be awesome if you could share).

huggingface →
spaces/lmsys/chatbot-arena-leaderboard

Diet · April 13, 2024, 3:02am

Welcome to the community!

I’ve determined that I only trust those benchmarks that say what I personally believe, and in the face of overwhelming evidence I will rationalize about why I’m still right.

This is my current rationalization: GPT-4 and turbo have different strengths and weaknesses. In terms of shallow conversational ability, I don’t doubt that the newer models may perform better. But repeated (personal) evaluation consistently shows that newer models (can’t speak to the 04 model yet) struggle with certain tasks.

The question is: do you want a conversationalist, or a workhorse?

_j · April 13, 2024, 3:03am

A significant oversight of these benchmarks is that they present the evaluator just with a chatbot, and those responses that stimulate the user, such as by repeating back their own input to them, are what gets a choice when deciding between two outputs, usually a single shot.

What is not evaluated is the steerability and the performance when instructed and preprogrammed by a developer on advanced tasks, and the ability to maintain context awareness of a new goal, information injection, and long inputs. That’s where you’ll find the performance slashed - in actually DOING.

ClipFarms · April 13, 2024, 1:14pm

Not my experience at all for anything I’ve thrown at gpt-4-turbo.

I don’t trust the benchmarks. Mostly because I don’t understand their exact mechanics, and also partially because my personal experience is just that, my experience with it.

gpt-4-turbo is overly verbose, consistently ignores key logic, and I have to say, for coding it’s almost unusable. It doesn’t take system messages as well. Writing is better than coding, but still worse than gpt-4. It reminds me of 3.5 turbo.

So whatever the benchmarks say, all other things being equal in my setup (i.e., my personal benchmarking), gpt-4-turbo is worse and there is no reason for me to switch off gpt-4. The only nasty downside is the limited context window. But hey, that’s just me.

duncan.haywood · April 14, 2024, 8:42pm

@Diet @_j , what are your thoughts on performance comparison between Anthropic Opus compared and the old gpt4? If you’ve used them both?

duncan.haywood · April 14, 2024, 8:44pm

It’d be great to get gpt-4-turbo models on the benchmarks like MMLU, Human Eval, etc. all the same to see how they perform overall.

_j · April 14, 2024, 8:55pm

Pulling a MMLU “random” out of “misc”, and putting in a image less easy to scrape, it is the human evaluation on MMLU that one might go after:

Do you want an AI that chats it up with post-justification that doesn’t help reflection on the answer, or do you want an AI where the first token out performs correctly?

merefield · April 14, 2024, 9:22pm

And you are paying for all those extra tokens …

Diet · April 14, 2024, 11:26pm

I’d be ok with dropping 0314 for opus. It’s probably even cheaper in the long run, but I’m azure sponsored and not aws sponsored, so it’s not a decision at the moment.

I hope OpenAI can get their 'ish together and produce high performance models, but it’s hard to say whether they actually intend to.

duncan.haywood · April 15, 2024, 5:26am

I mean for coding, probably the first, but for general Q and A use as a consumer product the second, which is what they’re going for I think. What do you think about strict prompting on the new turbo models to be concise?

duncan.haywood · April 15, 2024, 5:27am

@ClipFarms
what are your thoughts on performance comparison between Anthropic Opus compared and the old gpt4? If you’ve used them both?

ClipFarms · April 15, 2024, 12:33pm

I use them both. Have to. 90% Opus, 10% GPT.

I explained with a CSS example in another comment here: How to deal with "lazy" GPT 4 - #134 by Diet

And I’ll say that’s still true. But for another example, I attempted to have Opus write some Django backend/database stuff for basic Google SSO, and if I had used its code, it would have borked my entire backend. I switched to gpt-4 and the model got it right on the second go.

At first I thought it was that gpt-4 was more “logical” but I dont’ think that’s the right way of putting it. It’s odd. Maybe just a difference in training data. I still use Opus for most things.

duncan.haywood · April 24, 2024, 3:36am

Here’s a benchmark on opus versus gpt-4 and 3.5. No releases on gpt-4-turbo, so sort of a black box on that. I’m hoping a lot they will release the turbo benchmarks, maybe even if they do it later after another model release that can outperform.

johncain194 · April 24, 2024, 3:40am

For a (directed) workhorse, it fails spectacularly at tasks I need.

For a conversationalist, it performs superbly with some caveats (for some topics, it is outright stubborn and unyielding) and becomes a total failure.

Currently, I only use around 1-5% of chatGPT and already unsubscribe to all / will not pay for the API access anymore (in my other accounts to avoid any account flagging if they decide to ban this account) and already migrate to other systems.

Probably a combination with other AGI systems would produce more flexible, faster and much cheaper than fully using OpenAI products.

Be careful with Claude3 Opus though, they can ban you on certain keywords. But, it performs much better than chatGPT 4 currently (and even with the newest GPT4 alternative)

duncan.haywood · May 13, 2024, 7:11pm

Finally had some benchmarks released and gpt-4-turbo and gpt-4-O both perform pretty well

openai/simple-evals (github.com)

Topic		Replies	Views
GPT-4-Turbo and GPT-4-O benchmarks released! They do well compared to the marketplace Community gpt-4	7	23935	May 17, 2024
Gpt4 comparison to anthropic Opus on benchmarks Community gpt-4 , api	9	38792	June 8, 2024
GPT-4o vs. gpt-4-turbo-2024-04-09, gpt-4o loses API gpt-4	38	14499	June 11, 2024
Performance of GPT-4o on the Needle in a Haystack Benchmark API chatgpt , api , gpt-4o	13	4976	June 13, 2024
List of fresh gpt-4o benchmarks, please add Community gpt-4o	1	3307	May 16, 2024

GPT-4-Turbo models perform better the older GPT-4 models in LMSys benchmark

For a (directed) workhorse, it fails spectacularly at tasks I need.

Related topics