Swe-bench: Very exciting eval, looking for SOTA

qrdl · April 27, 2024, 3:59am

Concept is “Can Language Models Resolve Real-World GitHub Issues?”

AutoCodeRover is best SOTA I am aware of at 16% … anyone know better?

another thread: Fully Autonomous AI Software Engineer Devin - #48 by elmstedt

Atm, for my purposes, this is the most intriguing eval. Pretty hard to overfit this one given how the agents separate out the problems.

Be interesting to see what big tech comes out with. I’m sure MSFT is hitting it hard. Google too I hope.

qrdl · May 3, 2024, 11:22pm

One thing I like about swe-bench is that this it’s a pretty good way to compare claude versus GPT4

This echos my experience as well. I don’t know why people are trash talking GPT4 in this forum.

N2U · May 4, 2024, 12:23am

I’m also a big fan of the SWE benchmark myself, although I think it’s often used to infer coding performance in an overly generalized way.

Ideally I want to see multiple good benchmark scores on a variety of coding tasks, SWE bench only measures the model’s ability to solve GitHub issues.

Empty barrels make the most noise, and happy costumers usually don’t complain

qrdl · May 4, 2024, 1:13am

All the current code evals are too easy to overfit, imho, though I’d love to hear about any that are not.

I imagine the current set on swe-bench will suffer from overfitting, but fortunately it’s pretty easy to select a new set. So training on that eval would be a rather bad idea and could cause reputational damage.

N2U · May 4, 2024, 8:11am

Totally agree, an interesting approach to this would be to create a new set of questions for an existing benchmark, and compare the performance between the two, I should perform worse if the model is being polluted due to the original benchmark being available on the internet

And luckily for us someone has already done that.

https://arxiv.org/html/2405.00332v1

And here’s the results (more details in the paper)

The interesting thing here is that we know that some of the model’s in that study has the same training dataset, as in the latest version of Mistral, here we see the medium sized model doing worse than the large one on questions it hasn’t seen before, so the takeaway here is that larger models are better at generalizing what they’ve learned

Topic		Replies	Views
GPT-4-Turbo models perform better the older GPT-4 models in LMSys benchmark API gpt-4 , api	14	6552	May 13, 2024
OpenAI releases new coding benchmark SWE-Lancer showing 3.5 Sonnet beating o1 Community ai-coding	1	2206	February 19, 2025
Performance of GPT-4o on the Needle in a Haystack Benchmark API chatgpt , api , gpt-4o	13	5510	June 13, 2024
Situational-awareness.ai, a brief writeup by Leopold Aschenbrenner Community chatgpt	11	21483	December 30, 2024
List of fresh gpt-4o benchmarks, please add Community gpt-4o	1	3435	May 16, 2024

Swe-bench: Very exciting eval, looking for SOTA

Related topics