Swe-bench: Very exciting eval, looking for SOTA

N2U · May 4, 2024, 8:11am

Totally agree, an interesting approach to this would be to create a new set of questions for an existing benchmark, and compare the performance between the two, I should perform worse if the model is being polluted due to the original benchmark being available on the internet

And luckily for us someone has already done that.

https://arxiv.org/html/2405.00332v1

And here’s the results (more details in the paper)

The interesting thing here is that we know that some of the model’s in that study has the same training dataset, as in the latest version of Mistral, here we see the medium sized model doing worse than the large one on questions it hasn’t seen before, so the takeaway here is that larger models are better at generalizing what they’ve learned

Topic		Replies	Views
List of fresh gpt-4o benchmarks, please add Community gpt-4o	1	3203	May 16, 2024
Meta’s free Code Llama AI programming tool closes the gap with GPT-4? Community in-the-news	3	3751	January 30, 2024
Metaheuristic behavior platform Community	1	492	May 1, 2023
GPT-powered data querying tool. Gain instant insights into 5+ billion rows of GitHub data Community	0	958	February 17, 2023
CodeRabbit - AI-based code reviewer Community gpt-4	0	2810	November 24, 2023

Swe-bench: Very exciting eval, looking for SOTA

Related topics