I'd like to report a terrible academic crime

qrdl · August 13, 2024, 10:26pm

I can’t believe the authors did this. Yes, a better set was a good idea, but OpenAI should not have been involved at all.

Unreal and so incredibly disappointing. The SWE bench was one of the best tests available and now its integrity and validity have been totally compromised.

jlvanhulst · August 14, 2024, 1:38am

All the work is out there in the open - ie all human annotations. And everyone model vendor has access to all of it. Have worked on bug reports for years I can attest that getting to a clear understanding of ‘bug or feature’ and also a clear ‘what does fixed look like’ is often more than 80% of the trouble.
I am sure that in the future we are going to add SWE++ benches that include ‘go figure it out’ - which would include asking the right questions on issues for clarification. But of course some would argue ‘that way anyone can solve it’. So yeah … fascinating stuff these benchmarks

anon10827405 · August 14, 2024, 5:36am

I’m not disagreeing with you. Can you elaborate on the issue here?

merefield · August 14, 2024, 6:19am

I take it you are concerned that it’s no longer fully independent?

vb · August 14, 2024, 7:57am

Are you concerned about the validity of the benchmark for performance evaluations? From what I understand from the blog post, this is more of an AI safety issue and should be treated as such. If we were to see SWE-bench or marketing materials report these results as indicators of LLM performance, that would indeed be problematic.

From the blog post:

We use SWE-bench as one of several evaluations tracking the medium risk level of the Model Autonomy risk category in our Preparedness Framework. Tracking catastrophic risk levels through evaluations depends on ensuring that we can trust the evaluation results and that we are properly calibrated in understanding what the scores represent.

qrdl · September 14, 2024, 3:47am

I think OpenAI folk should have come up with their own swe-bench and called it the openai-bench.

If the openai-models perform well on the openai-bench, well, great, I guess.

Topic		Replies	Views
Swe-bench: Very exciting eval, looking for SOTA Community gpt-4 , chatgpt	4	1511	May 4, 2024
AI-to-AI Risks: How Ignored Warnings Led to the DeepSeek Incident Community cybersecurity , openai , training	1	2211	January 31, 2025
Situational-awareness.ai, a brief writeup by Leopold Aschenbrenner Community chatgpt	11	22206	December 30, 2024
Biggest problem with LLMs: "LLMs don't know anything about how they themselves are built" Community agi	22	350	February 7, 2025
Limitations of the OpenAI API in Tackling Academic Fraud at a Global Scale Community api	2	257	September 13, 2024

I'd like to report a terrible academic crime

Related topics