Reasoning Degradation in LLMs with Long Context Windows: New Benchmarks

That’s intriguing. I wonder why Sonnet 3.5 and Gemini perform better with negative d values?