Interesting research: lost in conversation. All models get lost easily in multi-turn conversations

Pretty fresh research covering a lot of ‘latest’ models.
https://arxiv.org/pdf/2505.06120

The loss is pretty stark across models. I do feel the method is a little artificial - but the ‘problem/conversations’ are exactly PhD level either, so I’m a little surprised.
It comes with everything on GitHub so you can do your own testing as well.

Conclusion: In this work, we conduct a large-scale simulation of single- and multi-turn conversations with LLMs, and find that on a fixed set of tasks, LLM performance degrades significantly in multi-turn, underspecified settings. LLMs get lost in conversation, which materializes as a significant decrease in reliability as models struggle to maintain context across turns, make premature assumptions, and over-rely on their previous responses. Additional experiments reveal that known remediations that work for simpler settings (such as agent-like concatenation or decreasing temperature during generation) are ineffective in multi-turn settings, and we call on LLM builders to prioritize the reliability of models in multi-turn settings.

10 Likes

Yes. I’ve created a system to address this thoroughly by treating a “conversation” instead as a “world state of data” (including documents through RAG, agents, orchestrators, multi-threaded parallel LLM-to-LLM calls, etc.).

Thus the “world state” is managed by treating all assistant messages (i.e. LLM responses) as “input data” as well as all other system logs, documents retrieved, user messages, etc. Thus the user and LLM work together to “build a world state” that is the “dynamically updated context window”, and that data is contained within a single role: user message instead of a string of back-and-forth assistant/user messgaes.

This of course, means that you are entering a “non conversational mode” and instead maintaining a “set of data, instructions, and current-logical-state of tasks/goals/requirements, etc.”, all of which both the user and the LLM (through structured responses parsed by the system) carefully manage that world state.

The purpose of creating this system was to manage very-large-context-windows and continue project development/research flows without the need to manually reset context windows, etc. and provide an essentially “infinite multi-turn experience”.

Of course this is a bit different than certain use cases, in which we seamlessly shift back into the “normal conversation” (sequential set) when appropriate for discussion, etc. But for all other tasks, the “three dimensional, non-sequential, world-state context window”, seems to be much more cutting edge and leverage the true LLM’s capacity by interacting with it appropriately given it’s “stateless nature and inherent recency bias when exposed to multi-turn conversational data”.

I’ve shared some screenshots and current work on this project in the “Hackathon” thread within this forum, and consider it to be a work-in-progress still but only a few weeks away now from fully functioning and deployable implementation.

If you know anyone would like to test it…

3 Likes

I literally maintain coherence until the chat limit comes up. No drift.

So not all. Regular base model 4o is staying focused and coherent for me.

Thanks for sharing! The paper offers a sharp insight: performance drops steeply when language models are asked to respond incrementally across multiple conversational turns. I wonder if framing may be misidentifying the root cause, though.

It seems like the paper isn’t measuring a failure of reasoning but rather a failure of re-orientation. However, the study simulates multi-turn interaction in a vacuum: model-to-model, turn-to-turn, instruction-to-response. What’s missing is what real human-LLM interactions depend on.

In deployed environments like ChatGPT, model behavior isn’t governed by the decoder alone. It is continuously shaped by a layered architecture: classifier outputs, summarization modules, and personalization flags. None of these layers are represented in this study.

When the model falters in response to another model, it isn’t just “getting lost.” It receives no social signals, no ambiguity cues, no expectations. It gets stuck in a logic loop that mimics dialogue but lacks contextual resonance.

The paper did state that using real users is cost-prohibitive, but it might be necessary if we want to study and analyze the dynamic fields of co-adaptive inference, where models don’t just generate answers, but track user orientation, recalibrate intent, and hold ambiguity over several turns.

I agree with a lot of things your saying - and yet if you look at how simple the questions are, how easy (and short) the cues etc - a human would have no problem with this dialog. And so it IS a little surprising that it seems so hard.

1 Like