Still, it would be nice if they did run out on the old benchmark for the 1:1 comparison… Or if the ran the old model on the new benchmark.
Nevermind I guess they ran the o3 and o3-pro on the new benchmark.
Still, it would be great if they re-ran all their models on the new benchmark to really show the improvement.
Also why is there no standard versioning for the codeforces benchmark?