OpenAI releases new coding benchmark SWE-Lancer showing 3.5 Sonnet beating o1

It seems odd that o3 wasn’t included. The only reference is:

In August 2024, when OpenAI introduced SWE-Bench Verified, GPT-4o achieved a 33% score. Currently, their o3 reasoning model sets a new standard with a 72% score (OpenAI, 2024b), emphasizing the importance of comprehensive evaluations that mirror the intricacies of actual software engineering.

Why do you think they omitted o3 benchmarks? Could it be because their performance is significantly high or low?

It’s really interesting to me that the paper shows Claude 3.5 Sonnet as the best model. While it’s incredibly common for small open source models to point out how close they come to the performance of the big closed models, it’s not often a major AI player releases something showing their competitor beating them soundly especially with a model that is 8 months old now.

It may be time to take another look at Anthropic soon!

You can find more details in the paper and their GitHub repository.

1 Like

In my experience with AI, which I do use daily and extensively, I have to say that Claude 3.5 Sonnet is truly remarkable and stands out from the rest.
It’s ability to not only produce high quality code but also it’s capability of understanding the request the user has based off of a prompt is incredible and the best in its league.

The major problem with most AIs isn’t that they’re not capable of generating correct things, their issues lie within understanding the prompt of the user.

Most people don’t have a doctrate in prompting and so it is key for state-of-the-art AIs to truly understand what a user wants from a simple prompt.
Claude takes care of things the user hasn’t mentioned.
With models like 4o you need to specify every little thing, otherwise the output is mediocre at best.

I hope that this realisation will lead to o3 and future models being superior to claude. :hugs:

2 Likes