OpenAI releases new coding benchmark SWE-Lancer showing 3.5 Sonnet beating o1

356 · February 18, 2025, 7:32pm

It seems odd that o3 wasn’t included. The only reference is:

In August 2024, when OpenAI introduced SWE-Bench Verified, GPT-4o achieved a 33% score. Currently, their o3 reasoning model sets a new standard with a 72% score (OpenAI, 2024b), emphasizing the importance of comprehensive evaluations that mirror the intricacies of actual software engineering.

Why do you think they omitted o3 benchmarks? Could it be because their performance is significantly high or low?

It’s really interesting to me that the paper shows Claude 3.5 Sonnet as the best model. While it’s incredibly common for small open source models to point out how close they come to the performance of the big closed models, it’s not often a major AI player releases something showing their competitor beating them soundly especially with a model that is 8 months old now.

It may be time to take another look at Anthropic soon!

You can find more details in the paper and their GitHub repository.

j.wischnat · February 19, 2025, 7:54am

In my experience with AI, which I do use daily and extensively, I have to say that Claude 3.5 Sonnet is truly remarkable and stands out from the rest.
It’s ability to not only produce high quality code but also it’s capability of understanding the request the user has based off of a prompt is incredible and the best in its league.

The major problem with most AIs isn’t that they’re not capable of generating correct things, their issues lie within understanding the prompt of the user.

Most people don’t have a doctrate in prompting and so it is key for state-of-the-art AIs to truly understand what a user wants from a simple prompt.
Claude takes care of things the user hasn’t mentioned.
With models like 4o you need to specify every little thing, otherwise the output is mediocre at best.

I hope that this realisation will lead to o3 and future models being superior to claude.

Topic		Replies	Views
Gpt4 comparison to anthropic Opus on benchmarks Community gpt-4 , api	9	40447	June 8, 2024
Swe-bench: Very exciting eval, looking for SOTA Community gpt-4 , chatgpt	4	1451	May 4, 2024
GPT-4-Turbo models perform better the older GPT-4 models in LMSys benchmark API gpt-4 , api	14	6520	May 13, 2024
O1, GPT-5, and the Future of LLMs: Can OpenAI Stay Ahead? Community chatgpt , o1	1	2996	September 14, 2024
Thoughts on GPT-3.5-Turbo vs. Claude 3 Haiku Community gpt-35-turbo	4	9954	April 13, 2024

OpenAI releases new coding benchmark SWE-Lancer showing 3.5 Sonnet beating o1

Related topics