I’m building an AI chat system where users can have long, continuous conversations with different LLM models.
The main challenge is maintaining high conversation quality while also keeping token usage and cost under control over time.
Since conversations can grow very large, sending the entire history on every request is not practical. At the same time, aggressive summarization can hurt the quality of the interaction.
This becomes even more challenging because different models have:
different context window sizes
different tokenization behavior
different input/output pricing
So a strategy that works well for one model may not be optimal for another.
I’m trying to understand:
What are the best proven patterns for managing short-term conversation context in production AI chat systems in a way that balances:
conversation quality
cost efficiency
scalability across many different LLM providers
Specifically:
How should raw messages vs summaries be balanced?
How should systems decide how much recent history to include?
Are there established architectural patterns for this problem?
I’m also very curious how systems like ChatGPT and Claude approach this internally when conversations become long.
I have many ideas but I’m looking for the best approach.
Has this problem been solved in a reusable or well-documented way by any team or open source project?
I personally don’t use Agents SDK but to generalize what is in the cookbook - you essentially want to have some kind of a “compression” process on the side. The issue with summarization is that you lose a lot of information. So you really want a well designed prompt (and structure) for extracting key claims in a conversation. And I would always opt for a model like gpt-5.2 to perform something like that. You can store those in a markdown file (or JSON, or database). I find that this is really all you need.
A decision for when to initiate this claim extraction or compression can be anything really. But the easiest is some length of characters (e.g. 10k chars) in the messages (user-assistant turns). You can add some basic filtering where you look if one message was over some min length of characters (e.g. at least 5) but be careful with this as it may exclude numbers of super short answers (like “yes”).
What I’m trying to do and figure out is something like what Cursor did. I’m curious how Cursor manages a context as large as an entire codebase! I want to handle something like this in my app. Handling context management with good quality and of course, with ideal cost. I hope the cookbook helps me.
I would also look at Codex - this has got particularly good of late.
Here’s what I did:
I git cloned the repo
I launched codex inside the cloned repo … of codex (oooooh meta )
Asked it: “summarise how context is managed by this app for large code bases.”
And it gave me this response
- It stores conversation history as ResponseItems and builds each model prompt from normalized history + fixed session context (developer
instructions, AGENTS/user instructions, environment context). See codex-rs/core/src/codex.rs:2493, codex-rs/core/src/codex.rs:2446, codex-rs/
core/src/context_manager/history.rs:86.
- It aggressively bounds noisy content: tool outputs are truncated (token/byte policy), long text is middle-truncated, and function output content
items are budgeted. See codex-rs/core/src/context_manager/history.rs:327, codex-rs/core/src/truncate.rs:88, codex-rs/core/src/truncate.rs:100.
- It tracks token usage and context window continuously (server usage + local estimates), then emits TokenCount updates to UI. See codex-rs/core/
src/codex.rs:2603, codex-rs/protocol/src/protocol.rs:1443.
- Auto-compaction triggers when usage crosses model threshold; threshold defaults to ~90% of model window (clamped) unless overridden. See codex-
rs/core/src/codex.rs:4435, codex-rs/core/src/codex.rs:4491, codex-rs/protocol/src/openai_models.rs:259, codex-rs/core/src/config/mod.rs:158.
- Compaction rewrites history into: initial context + recent user messages (capped) + a handoff summary, then resumes. Inline path: codex-rs/core/
src/compact.rs:67, codex-rs/core/src/compact.rs:295. Remote path: codex-rs/core/src/compact_remote.rs:60.
- UI context meter reserves baseline overhead (BASELINE_TOKENS=12000) so “% remaining” reflects user-controllable space, not fixed prompt/tool
overhead. See codex-rs/protocol/src/protocol.rs:1478, codex-rs/protocol/src/protocol.rs:1512.
- Repo exploration tools are intentionally scoped/limited (instead of full-repo ingest): read_file uses offset/limit and line-length caps,
grep_files has result/time limits, fuzzy file search caps results. See codex-rs/core/src/tools/handlers/read_file.rs:31, codex-rs/core/src/tools/
handlers/read_file.rs:19, codex-rs/core/src/tools/handlers/grep_files.rs:20, codex-rs/app-server/src/fuzzy_file_search.rs:16.
Inference from code search: I don’t see a built-in embedding/vector “index entire codebase” pipeline in core; context is assembled from history +
targeted tool reads.
Given its performance, I’d wager this is state of the art …