Education level interpretation of Gpt-4o's benchmarks

Some more benchmark stuff here - List of fresh gpt-4o benchmarks, please add - #2 by qrdl

Seeing a lot of results that are failing on longer contexts