Pretty fresh research covering a lot of ‘latest’ models.
https://arxiv.org/pdf/2505.06120
The loss is pretty stark across models. I do feel the method is a little artificial - but the ‘problem/conversations’ are exactly PhD level either, so I’m a little surprised.
It comes with everything on GitHub so you can do your own testing as well.
Conclusion: In this work, we conduct a large-scale simulation of single- and multi-turn conversations with LLMs, and find that on a fixed set of tasks, LLM performance degrades significantly in multi-turn, underspecified settings. LLMs get lost in conversation, which materializes as a significant decrease in reliability as models struggle to maintain context across turns, make premature assumptions, and over-rely on their previous responses. Additional experiments reveal that known remediations that work for simpler settings (such as agent-like concatenation or decreasing temperature during generation) are ineffective in multi-turn settings, and we call on LLM builders to prioritize the reliability of models in multi-turn settings.