All the current code evals are too easy to overfit, imho, though I’d love to hear about any that are not.
I imagine the current set on swe-bench will suffer from overfitting, but fortunately it’s pretty easy to select a new set. So training on that eval would be a rather bad idea and could cause reputational damage.
Totally agree, an interesting approach to this would be to create a new set of questions for an existing benchmark, and compare the performance between the two, I should perform worse if the model is being polluted due to the original benchmark being available on the internet
The interesting thing here is that we know that some of the model’s in that study has the same training dataset, as in the latest version of Mistral, here we see the medium sized model doing worse than the large one on questions it hasn’t seen before, so the takeaway here is that larger models are better at generalizing what they’ve learned