Gpt4 comparison to anthropic Opus on benchmarks

In a comparative assessment of Claude 3 Opus and GPT-4’s capabilities, Claude 3 Opus generally demonstrates superior performance across a spectrum of tasks that test for knowledge and reasoning abilities. Claude 3 Opus consistently outperforms GPT-4, with an especially notable advantage in complex reasoning and coding tasks, suggesting it is better suited for applications requiring advanced cognitive processing.

Here are the individual comparisons in bullet points:

  • Undergraduate Level Knowledge: Claude 3 Opus scores 86.8%, slightly ahead of GPT-4’s 86.4%.
  • Graduate Level Reasoning: Claude 3 Opus has a significant lead with 50.4%, compared to GPT-4’s 35.7%.
  • Grade School Math: Claude 3 Opus achieves 95.0%, surpassing GPT-4’s 92.0%.
  • Multilingual Math: Claude 3 Opus leads with 90.7%, compared to GPT-4’s 74.5%.
  • Coding (HumanEval): Claude 3 Opus scores 84.9%, notably higher than GPT-4’s 67.0%.
  • Reasoning Over Text: Claude 3 Opus at 83.1% is ahead of GPT-4’s 80.9%.
  • Mixed Evaluations: Claude 3 Opus outperforms with 86.8%, in contrast to GPT-4’s 83.1%.
  • Knowledge Q&A: Claude 3 Opus marginally leads with 96.4%, while GPT-4 is close behind at 96.3%.
  • Common Knowledge: Claude 3 Opus scores 95.4%, slightly better than GPT-4’s 95.3%.

The overall results indicate that while GPT-4 is a highly competent model, Claude 3 Opus shows enhanced capabilities in dealing with a variety of problem-solving and knowledge-based tasks, possibly making it a more potent tool for tackling sophisticated AI challenges.

I would personally love to see how the gpt-4-turbo and following models weigh up on these benchmarks as well. Hopefully they outperform the opus model even :+1:t2:.

With the release of GPT-4o, OpenAI also released benchmark evals on old and new models, and these results have been upended by the new model. OpenAI simple-evals benchmark-results The results for the new model surpass the Anthropic Opus across the board seemingly on these repeatable open source evaluations. notably, the gpt-4o model outperforms the opus model on the human eval 91% to 85%, with higher throughput, lower latency, and a fraction of the price. LMSys leaderboards also show a strong preference for the new model above all others.

This has been my personal experience as well. Both Gemini(I went with the free trial for Advanced) and GPT-4 struggled to even comprehend very simple code generation requests. I was at the brink, when i deicded to give Claude 3 Sonnet and it blew these guys out of the water and I had my “MVP”. I am actually signing up for the Claude 3 opus as i am writing this. I might come back and update this after I have used opus for a bit for code generation for my project(Flask and Bootstrap5 nothing fancy).

Claude opus is 4 times better than chatgpt plus. I had it for one year with three paid accounts and one team account with two users. I cancelled it because chatGPT plus is waste of time … it doesn’t generate any long form codes with advanced logic. Claude opus is 10 times better and intelligent and highly helpful. Cancel chatGPT plus , it’s a waste of money . They build chatGPT for getting data from user inputs and help corporate companies like Microsoft… not for people like you


It depends on the task I’d say. GPT-4 is still the best at reasoning tasks, while Opus can be better at coding. We collected all data + did small experiments to compare the two, you can read our findings here: Claude 3 Opus vs GPT-4: Task Specific Analysis


Finally, we had a release from openAI on benchmarks, and GPT-4-turbo and GPT-4-O both do pretty well on them:

openai/simple-evals (

Yes, Opus is better for coding because it doesn’t try to save tokens output all the time. This is way I cancelled GPT.

On this topic, according to the simple evals OpenAI released, GPT-4o currently surpasses anthropic opus on the human eval bencmark: 91% to 85%.

In your opinion, which is better for creating web articles, GPT-4o, Claude 3 Opus or which other would you recommend?