Indeed, it is difficult to create a test that involves large context windows similar to real-world problems. However, the “Highlight Inefficient Code” test that I describe in the paper is a very practical and common example in the programming field.
*By the way, I tried to submit the paper to ArXiv to promote more activity in the creation of these benchmarks, but since I am not affiliated with any institution, I need to be “endorsed” by someone on the platform. This is the standard form I received:
Natanael Fraga requests your endorsement to submit an article to the cs.AI section of arXiv. To tell us that you would (or would not) like to endorse this person, please visit the following URL:
https://arxiv.org/auth/endorse?x=7WIRSX
We don’t expect you to read the paper in detail, or verify that the work is correct, but you should check that the paper is appropriate for the subject area.
If anyone from this community would be willing to endorse me, I would be grateful.