Inference-time systems proposal: KV-cache relay to eliminate redundant prefill across sub-agents

KV-Cache Relay: Eliminating Redundant Prefill in Multi-Agent Codex Pipelines

The Problem

When Codex delegates subtasks to multiple agents (refactor → test → fix → review), each agent re-reads the entire codebase context from scratch. With 5 agents on a 70K-token project, that’s 350K tokens of redundant prefill per pipeline run.

This costs time (10–15s of redundant prefill on H100), money (N × C tokens billed), and can degrade quality through “lost in the middle” effects on long contexts.

Proposed Mechanism

Architect agent reads full context once → KV-cache snapshot saved → worker agents load snapshot + short text instruction (~50–100 tokens) → workers generate code immediately, as if they read the full context themselves.

No retraining. No weight changes. Inference pipeline modification only. Backward-compatible API addition: two new optional parameters (save_kv_snapshot, load_kv_snapshot).

Experimental Validation

I ran 5 controlled experiments across 3 models (Qwen 1.5B, DeepSeek-R1 1.5B, Qwen-Coder 7B) on code generation tasks (FastAPI + SQLite + HTML/JS).

Key results:

Metric KV Relay Text Relay Direct
Instruction compliance (7B) 100% 100% 100%
Follows plan details (partial PUT, row_factory, response format) :white_check_mark: :cross_mark: :cross_mark:
Fact recall on technical docs (1.5B) 100% 25% 100%
Speed on factual retrieval 1.79s 12.55s 3.35s

The 25% recall on text relay is not a typo — the model hallucinated irrelevant content when summarizing a technical protocol, losing 3 out of 4 critical parameters. KV relay preserved all of them.

How This Differs from Prefix Caching

Prefix caching (already in the API) requires identical token sequences. KV-relay transfers the architect’s FULL session state — including generated thinking tokens and plan reasoning — to workers with DIFFERENT instruction suffixes. The cached state contains the architect’s analysis, not just the static prompt.

Scaling

  • 5 agents, 70K context: 280K tokens saved per run (80% reduction)
  • 10 agents, 70K context: 630K tokens saved per run (90% reduction)
  • At API pricing, this is $500+/month for heavy enterprise workflows

Secondary Finding: Term Repetition

Repeating key terms 3–5× in the source context significantly improves instruction fidelity through KV relay. When “sqlite3” appeared once in the plan, the worker used SQLAlchemy. When repeated 12 times with 5 explicit “NOT SQLAlchemy” mentions, the worker followed the spec perfectly. This has implications for how AGENTS.md and system prompts should be structured.

Full Proposal

Detailed document with all experimental data, comparison against strongest baselines (prefix caching, RAG, shared context, Doc-to-LoRA), equivalence boundaries, security/isolation analysis, falsification conditions, and recommended first test:

:page_facing_up: docs[dot]google[dot]com /document/d/1-kDkN6_pYIt_ASBEwChIsspiwlV_mDRA/edit

All experimental code (~2000 LOC Python) and raw JSON results available on request.


Security & Isolation

Adding a note on security/isolation that strengthens the case for KV relay in enterprise settings.

KV relay reduces exposure surface compared to text relay. This is the core security argument.

Text-based context relay creates a human-readable, model-agnostic artifact. That artifact can be logged, copied, indexed, forwarded, piped into RAG, or read by any other model. It’s the most portable and most leakable form of context transfer.

A KV snapshot is none of those things. It’s a model-specific computational state tied to a compatible architecture, weights, tokenizer, positional scheme, and serving implementation. Outside that exact compatibility envelope, it degrades from useful computational state into opaque tensors with no reliable task-level interpretation.

In practical terms:

  • Cross-model transfer is not meaningfully usable in practice
  • Cross-version transfer is brittle and often invalidated by retraining, model updates, or serving changes
  • In closed-model deployments, snapshots can remain entirely within the provider’s serving infrastructure rather than being externalized as user-readable text

This doesn’t make KV relay “cryptographically secure” — but it does make it operationally far less portable and less exfiltration-friendly than text relay. For multi-tenant agent pipelines handling sensitive codebases, that is a meaningful improvement over the status quo.

Would love to hear thoughts from the team and community — especially anyone working on multi-agent orchestration in Codex. Has anyone else run into the redundant prefill problem at scale?