Gpt4 comparison to anthropic Opus on benchmarks

duncan.haywood · April 24, 2024, 3:49am

In a comparative assessment of Claude 3 Opus and GPT-4’s capabilities, Claude 3 Opus generally demonstrates superior performance across a spectrum of tasks that test for knowledge and reasoning abilities. Claude 3 Opus consistently outperforms GPT-4, with an especially notable advantage in complex reasoning and coding tasks, suggesting it is better suited for applications requiring advanced cognitive processing.

Here are the individual comparisons in bullet points:

Undergraduate Level Knowledge: Claude 3 Opus scores 86.8%, slightly ahead of GPT-4’s 86.4%.
Graduate Level Reasoning: Claude 3 Opus has a significant lead with 50.4%, compared to GPT-4’s 35.7%.
Grade School Math: Claude 3 Opus achieves 95.0%, surpassing GPT-4’s 92.0%.
Multilingual Math: Claude 3 Opus leads with 90.7%, compared to GPT-4’s 74.5%.
Coding (HumanEval): Claude 3 Opus scores 84.9%, notably higher than GPT-4’s 67.0%.
Reasoning Over Text: Claude 3 Opus at 83.1% is ahead of GPT-4’s 80.9%.
Mixed Evaluations: Claude 3 Opus outperforms with 86.8%, in contrast to GPT-4’s 83.1%.
Knowledge Q&A: Claude 3 Opus marginally leads with 96.4%, while GPT-4 is close behind at 96.3%.
Common Knowledge: Claude 3 Opus scores 95.4%, slightly better than GPT-4’s 95.3%.

The overall results indicate that while GPT-4 is a highly competent model, Claude 3 Opus shows enhanced capabilities in dealing with a variety of problem-solving and knowledge-based tasks, possibly making it a more potent tool for tackling sophisticated AI challenges.

I would personally love to see how the gpt-4-turbo and following models weigh up on these benchmarks as well. Hopefully they outperform the opus model even .

Edit:
With the release of GPT-4o, OpenAI also released benchmark evals on old and new models, and these results have been upended by the new model. OpenAI simple-evals benchmark-results The results for the new model surpass the Anthropic Opus across the board seemingly on these repeatable open source evaluations. notably, the gpt-4o model outperforms the opus model on the human eval 91% to 85%, with higher throughput, lower latency, and a fraction of the price. LMSys leaderboards also show a strong preference for the new model above all others.

vsap78 · April 30, 2024, 2:59pm

This has been my personal experience as well. Both Gemini(I went with the free trial for Advanced) and GPT-4 struggled to even comprehend very simple code generation requests. I was at the brink, when i deicded to give Claude 3 Sonnet and it blew these guys out of the water and I had my “MVP”. I am actually signing up for the Claude 3 opus as i am writing this. I might come back and update this after I have used opus for a bit for code generation for my project(Flask and Bootstrap5 nothing fancy).

Simon-natw · May 3, 2024, 9:05pm

Claude opus is 4 times better than chatgpt plus. I had it for one year with three paid accounts and one team account with two users. I cancelled it because chatGPT plus is waste of time … it doesn’t generate any long form codes with advanced logic. Claude opus is 10 times better and intelligent and highly helpful. Cancel chatGPT plus , it’s a waste of money . They build chatGPT for getting data from user inputs and help corporate companies like Microsoft… not for people like you

anitaa · May 7, 2024, 2:15pm

It depends on the task I’d say. GPT-4 is still the best at reasoning tasks, while Opus can be better at coding. We collected all data + did small experiments to compare the two, you can read our findings here: Claude 3 Opus vs GPT-4: Task Specific Analysis

duncan.haywood · May 13, 2024, 7:10pm

Finally, we had a release from openAI on benchmarks, and GPT-4-turbo and GPT-4-O both do pretty well on them:

openai/simple-evals (github.com)

dliedke · May 17, 2024, 11:17am

Yes, Opus is better for coding because it doesn’t try to save tokens output all the time. This is way I cancelled GPT.

duncan.haywood · May 17, 2024, 9:25pm

On this topic, according to the simple evals OpenAI released, GPT-4o currently surpasses anthropic opus on the human eval bencmark: 91% to 85%.

Marcus5 · May 26, 2024, 10:47pm

In your opinion, which is better for creating web articles, GPT-4o, Claude 3 Opus or which other would you recommend?

teryanarmen · June 5, 2024, 2:15am

I’ve found Claude much better at writing and coding than GPT4 or GPT4o

moonexchange.ng · June 8, 2024, 3:36pm

that settles it for me then.
danke

Topic		Replies	Views
GPT-4-Turbo models perform better the older GPT-4 models in LMSys benchmark API gpt-4 , api	14	6661	May 13, 2024
Which AI is best for Python coding? Community gpt-4 , chatgpt	6	17726	February 27, 2025
Which OpenAI model is the most code oriented? API code	2	1919	May 5, 2025
Thoughts on GPT-3.5-Turbo vs. Claude 3 Haiku Community gpt-35-turbo	4	10151	April 13, 2024
GPT-4o vs. gpt-4-turbo-2024-04-09, gpt-4o loses API gpt-4	38	15024	June 11, 2024

Gpt4 comparison to anthropic Opus on benchmarks

Related topics