I'd like to report a terrible academic crime

I can’t believe the authors did this. Yes, a better set was a good idea, but OpenAI should not have been involved at all.

Unreal and so incredibly disappointing. The SWE bench was one of the best tests available and now its integrity and validity have been totally compromised.

1 Like

All the work is out there in the open - ie all human annotations. And everyone model vendor has access to all of it. Have worked on bug reports for years I can attest that getting to a clear understanding of ‘bug or feature’ and also a clear ‘what does fixed look like’ is often more than 80% of the trouble.
I am sure that in the future we are going to add SWE++ benches that include ‘go figure it out’ - which would include asking the right questions on issues for clarification. But of course some would argue ‘that way anyone can solve it’. So yeah … fascinating stuff these benchmarks :slight_smile:

1 Like

I’m not disagreeing with you. Can you elaborate on the issue here?

I take it you are concerned that it’s no longer fully independent?

Are you concerned about the validity of the benchmark for performance evaluations? From what I understand from the blog post, this is more of an AI safety issue and should be treated as such. If we were to see SWE-bench or marketing materials report these results as indicators of LLM performance, that would indeed be problematic.

From the blog post:

We use SWE-bench as one of several evaluations tracking the medium risk level of the Model Autonomy risk category in our Preparedness Framework. Tracking catastrophic risk levels through evaluations depends on ensuring that we can trust the evaluation results and that we are properly calibrated in understanding what the scores represent.