Best practices for cost-efficient, high-quality context management in long AI chats

reza6 · February 11, 2026, 8:12pm

I’m building an AI chat system where users can have long, continuous conversations with different LLM models.

The main challenge is maintaining high conversation quality while also keeping token usage and cost under control over time.

Since conversations can grow very large, sending the entire history on every request is not practical. At the same time, aggressive summarization can hurt the quality of the interaction.

This becomes even more challenging because different models have:

different context window sizes
different tokenization behavior
different input/output pricing

So a strategy that works well for one model may not be optimal for another.

I’m trying to understand:

What are the best proven patterns for managing short-term conversation context in production AI chat systems in a way that balances:

conversation quality
cost efficiency
scalability across many different LLM providers

Specifically:

How should raw messages vs summaries be balanced?
How should systems decide how much recent history to include?
Are there established architectural patterns for this problem?

I’m also very curious how systems like ChatGPT and Claude approach this internally when conversations become long.

I have many ideas but I’m looking for the best approach.

Has this problem been solved in a reusable or well-documented way by any team or open source project?

platypus · February 11, 2026, 10:18pm

Hi @reza6 and welcome to the community.

There is a great cookbook on how to manage long running contexts here: Context Engineering for Personalization - State Management with Long-Term Memory Notes using OpenAI Agents SDK

I personally don’t use Agents SDK but to generalize what is in the cookbook - you essentially want to have some kind of a “compression” process on the side. The issue with summarization is that you lose a lot of information. So you really want a well designed prompt (and structure) for extracting key claims in a conversation. And I would always opt for a model like gpt-5.2 to perform something like that. You can store those in a markdown file (or JSON, or database). I find that this is really all you need.

A decision for when to initiate this claim extraction or compression can be anything really. But the easiest is some length of characters (e.g. 10k chars) in the messages (user-assistant turns). You can add some basic filtering where you look if one message was over some min length of characters (e.g. at least 5) but be careful with this as it may exclude numbers of super short answers (like “yes”).

reza6 · February 13, 2026, 11:40am

Thanks for responding. I’ll read that cookbook.

What I’m trying to do and figure out is something like what Cursor did. I’m curious how Cursor manages a context as large as an entire codebase! I want to handle something like this in my app. Handling context management with good quality and of course, with ideal cost. I hope the cookbook helps me.

merefield · February 13, 2026, 4:41pm

I would also look at Codex - this has got particularly good of late.

Here’s what I did:

I git cloned the repo
I launched codex inside the cloned repo … of codex (oooooh meta )
Asked it: “summarise how context is managed by this app for large code bases.”

And it gave me this response

  - It stores conversation history as ResponseItems and builds each model prompt from normalized history + fixed session context (developer
    instructions, AGENTS/user instructions, environment context). See codex-rs/core/src/codex.rs:2493, codex-rs/core/src/codex.rs:2446, codex-rs/
    core/src/context_manager/history.rs:86.
  - It aggressively bounds noisy content: tool outputs are truncated (token/byte policy), long text is middle-truncated, and function output content
    items are budgeted. See codex-rs/core/src/context_manager/history.rs:327, codex-rs/core/src/truncate.rs:88, codex-rs/core/src/truncate.rs:100.
  - It tracks token usage and context window continuously (server usage + local estimates), then emits TokenCount updates to UI. See codex-rs/core/
    src/codex.rs:2603, codex-rs/protocol/src/protocol.rs:1443.
  - Auto-compaction triggers when usage crosses model threshold; threshold defaults to ~90% of model window (clamped) unless overridden. See codex-
    rs/core/src/codex.rs:4435, codex-rs/core/src/codex.rs:4491, codex-rs/protocol/src/openai_models.rs:259, codex-rs/core/src/config/mod.rs:158.
  - Compaction rewrites history into: initial context + recent user messages (capped) + a handoff summary, then resumes. Inline path: codex-rs/core/
    src/compact.rs:67, codex-rs/core/src/compact.rs:295. Remote path: codex-rs/core/src/compact_remote.rs:60.
  - UI context meter reserves baseline overhead (BASELINE_TOKENS=12000) so “% remaining” reflects user-controllable space, not fixed prompt/tool
    overhead. See codex-rs/protocol/src/protocol.rs:1478, codex-rs/protocol/src/protocol.rs:1512.
  - Repo exploration tools are intentionally scoped/limited (instead of full-repo ingest): read_file uses offset/limit and line-length caps,
    grep_files has result/time limits, fuzzy file search caps results. See codex-rs/core/src/tools/handlers/read_file.rs:31, codex-rs/core/src/tools/
    handlers/read_file.rs:19, codex-rs/core/src/tools/handlers/grep_files.rs:20, codex-rs/app-server/src/fuzzy_file_search.rs:16.

  Inference from code search: I don’t see a built-in embedding/vector “index entire codebase” pipeline in core; context is assembled from history +
  targeted tool reads.

Given its performance, I’d wager this is state of the art …

misha-promptowl · June 26, 2026, 12:38pm

Auto-compaction saves tokens by throwing away detail, and it has no idea which detail mattered — it will summarize away the one constraint that keeps the next answer correct. Cheaper per call, more expensive when the model reconstructs something wrong three turns later.

The lever that actually holds quality is keeping the durable facts out of the chat entirely and pulling only the relevant, current ones back in per turn. I keep those facts as versioned markdown and retrieve them through ctx (npx @promptowl/contextnest-cli): local-first, each fact carries its history, and retrieval serves the current version instead of every match. Token cost tracks what you actually pull, not the size of the whole history. It’s MCP, so you can route cheap-vs-capable models without re-feeding context each time.

Research paper and github link at prompt owl website or reply if you want to collaborate

Topic		Replies	Views
Strategy for chat history, context window, and summaries API	3	8978	April 12, 2023
Managing Context in a Conversation Bot with Fixed Token Limits API gpt-4 , api	2	1523	January 16, 2025
Maintaining Context in Long-Running GPT-4o API Conversations for Executive Desktop Application? API gpt-4 , chatgpt , api	0	274	April 29, 2025
Seeking guidance on managing long conversations and token limits while implementing ChatGPT in a mobile app for a design application API	6	2747	November 15, 2023
Memory-First Conversational Architecture as an Alternative to Long Context Windows Prompting api	8	957	March 4, 2026

Best practices for cost-efficient, high-quality context management in long AI chats

Related topics