Totally agree, an interesting approach to this would be to create a new set of questions for an existing benchmark, and compare the performance between the two, I should perform worse if the model is being polluted due to the original benchmark being available on the internet
And luckily for us someone has already done that.
https://arxiv.org/html/2405.00332v1
And here’s the results (more details in the paper)
The interesting thing here is that we know that some of the model’s in that study has the same training dataset, as in the latest version of Mistral, here we see the medium sized model doing worse than the large one on questions it hasn’t seen before, so the takeaway here is that larger models are better at generalizing what they’ve learned