It seems odd that o3 wasn’t included. The only reference is:
In August 2024, when OpenAI introduced SWE-Bench Verified, GPT-4o achieved a 33% score. Currently, their o3 reasoning model sets a new standard with a 72% score (OpenAI, 2024b), emphasizing the importance of comprehensive evaluations that mirror the intricacies of actual software engineering.
Why do you think they omitted o3 benchmarks? Could it be because their performance is significantly high or low?
It’s really interesting to me that the paper shows Claude 3.5 Sonnet as the best model. While it’s incredibly common for small open source models to point out how close they come to the performance of the big closed models, it’s not often a major AI player releases something showing their competitor beating them soundly especially with a model that is 8 months old now.
It may be time to take another look at Anthropic soon!
You can find more details in the paper and their GitHub repository.