Swe-bench: Very exciting eval, looking for SOTA

Concept is “Can Language Models Resolve Real-World GitHub Issues?”

AutoCodeRover is best SOTA I am aware of at 16% … anyone know better?

another thread: Fully Autonomous AI Software Engineer Devin - #48 by elmstedt

Atm, for my purposes, this is the most intriguing eval. Pretty hard to overfit this one given how the agents separate out the problems.

Be interesting to see what big tech comes out with. I’m sure MSFT is hitting it hard. Google too I hope.

1 Like

One thing I like about swe-bench is that this it’s a pretty good way to compare claude versus GPT4

This echos my experience as well. I don’t know why people are trash talking GPT4 in this forum.

1 Like

I’m also a big fan of the SWE benchmark myself, although I think it’s often used to infer coding performance in an overly generalized way.

Ideally I want to see multiple good benchmark scores on a variety of coding tasks, SWE bench only measures the model’s ability to solve GitHub issues.

Empty barrels make the most noise, and happy costumers usually don’t complain :sweat_smile:

2 Likes

All the current code evals are too easy to overfit, imho, though I’d love to hear about any that are not.

I imagine the current set on swe-bench will suffer from overfitting, but fortunately it’s pretty easy to select a new set. So training on that eval would be a rather bad idea and could cause reputational damage.

2 Likes

Totally agree, an interesting approach to this would be to create a new set of questions for an existing benchmark, and compare the performance between the two, I should perform worse if the model is being polluted due to the original benchmark being available on the internet :thinking:

And luckily for us someone has already done that.

https://arxiv.org/html/2405.00332v1

And here’s the results (more details in the paper)

The interesting thing here is that we know that some of the model’s in that study has the same training dataset, as in the latest version of Mistral, here we see the medium sized model doing worse than the large one on questions it hasn’t seen before, so the takeaway here is that larger models are better at generalizing what they’ve learned :laughing:

2 Likes