Persistent 0% prompt cache hits on GPT-5.5 with Auckland NZ Cloudflare 520s complicating every workaround

slackermanz · June 16, 2026, 5:46am

I’m trying to understand whether this is a known GPT-5.5 / Responses API issue, an account/project/routing issue, or something unusual about my request shape.

I run a long-context agent harness from New Zealand, usually routed through Auckland / ANZ Cloudflare paths. The workload is repeated same-channel GPT-5.5 Responses API traffic with very large context windows. Typical successful requests are around 290K–350K input tokens, roughly 1.3–1.5 MB serialized JSON, and usually 300–370 input items.

The prompt-cache behavior I’m seeing is effectively broken. Most successful responses report cached_tokens as 0 or omit cache telemetry entirely. Occasionally I see very small cache hits, but they are capped around a few thousand tokens, generally under about 6,000. I am not seeing anything like the large partial-prefix cache hits expected from repeated long-context traffic. The stable shared prefix can be over 1 MB, but the credited cache hit, when present at all, is tiny.

Current request posture:

Model: GPT-5.5
API: /v1/responses
prompt_cache_key: present and stable per channel/context family
prompt_cache_retention: “24h”
Background mode uses store: true
Requests are far above the 1024-token prompt-caching threshold
Request cadence per channel/cache key is below the documented high-throughput overflow concern
Same-channel calls often have very large stable prefixes, with only tail content changing

I have built local request/response logging and final pre-fetch request capture so I can compare what is actually being sent. For many same-channel adjacent calls, the request family is stable, the same 32K/128K/512K body-prefix fingerprints repeat, and the common serialized prefix is often around 1.2–1.4 MB. In those cases, the only meaningful changes are at the tail of the conversation. Despite that, the usage data still shows 0 cached tokens or only a tiny few-thousand-token cache hit.

The initial trigger for investigating all of this was a separate but related operational problem: Cloudflare 520 failures from the New Zealand/Auckland route. Those failures pushed me to test several different Responses API execution modes and transport shapes. Some modes make the 520 behavior much worse. Background polling is the most resilient delivery mode I’ve found, because foreground modes often fail or require recovery, but background mode still does not restore meaningful prompt caching.

Things I have tried so far:

Standard Responses API background polling
Foreground synchronous Responses mode
Streaming / recovery paths
Lean direct HTTP Responses calls without SDK intermediation
Chat Completions as a comparison path
With and without prompt_cache_key
Explicit prompt_cache_retention: “24h”
Stable channel-scoped cache keys
Cache-key rotation after repeated misses
Removing optional metadata and extra request adornments
Restoring background polling for delivery resilience
Preserving and replaying assistant message phase for Responses API conformance
Checking that request prefixes are actually stable immediately before fetch

None of this has restored meaningful cache utilization.

The current pattern is:

Background mode: best delivery reliability, still effectively no useful cache
Foreground modes: more 520/recovery failures, still no useful cache
Lean/direct paths: no meaningful cache improvement
Chat comparison: did not show meaningful cache recovery either
Some tiny cache hits occasionally appear, but only around a few thousand tokens, not the long-context prefix

I have an open support case, 10026139, but have not received a substantive response yet.

What I’m trying to determine:

Are other people currently seeing large prompt-cache hits on GPT-5.5 with long-context Responses API traffic?
Is there anything about large Responses API input arrays, around 300+ items and 1.3–1.5 MB bodies, that can prevent meaningful cache reuse even when the serialized prefix is stable?
Can project/account settings, region, service tier, or routing affect GPT-5.5 prompt-cache availability?
Is the Auckland / ANZ Cloudflare 520 pattern known to interact with Responses API background or foreground modes?
Is there any additional field or state-management requirement, beyond prompt_cache_key, prompt_cache_retention, store, and assistant phase replay, that is necessary for GPT-5.5 cache behavior to work correctly?

I can provide response IDs, x-request-id values, Cloudflare Ray IDs, timestamps, request-shape summaries, and representative examples to OpenAI staff if useful.

I’m not trying to argue from billing alone here. I’m trying to understand why a workload with repeated very large stable prefixes is getting either no cache credit or only tiny sub-6K-token cache hits, while also running into Auckland/Cloudflare 520 instability when testing alternative Responses modes. That said, the estimated additional cost to my project and operations as a result of the sudden unprovoked cache failure mode exceeds ~$15k in only a couple of weeks as uncached excess fees compared to what I would have paid without this multi-headed service disruption.

glenn.haugen · June 16, 2026, 10:28am

This is my exact experience nowadays as well when using GPT-5.4.

slackermanz · June 16, 2026, 10:48am

Yeah, I should have clarified, it’s definitely happening with GPT 5.4 as well.

The Cloudflare 520s have been a persistent issue, but I suspect that’s regional infrastructure or something like that. It was June 10th where my already low cache hit rate went to absolute zero on both models.

_j · June 16, 2026, 11:04pm

USA: No significant change in GPT-5.5 on Responses (top, pink: cached input, bottom, purple: uncached input)

The challenge for OpenAI will then be fixing their localization and Cloudflare routing.

gpt-5.3-codex only looking better because it spews tiny tool calls instead of following the offered tool mechanics to combine and parallel its outputs. But also unchanged.

slackermanz · June 17, 2026, 12:45am

What’s your API setup for the 5.5? How much conformance with the native contract are you participating in?

It’s the responses API, right? And are you doing it with streaming or non-streaming or background polling? I’d really like to know what you can tell me about what you’re getting successes with.

slackermanz · June 17, 2026, 6:54am

A bit of an update: I was able to seemingly resolve the Cloudflare 520s by bouncing my traffic off a foreign VPS.

This has allowed me to achieve stable API responses, even when using foreground synchronous mode.

However, I still have not managed to achieve stable prompt cache hits, even when routing my context windows through OpenRouter, which is particularly surprising.

My top suspicion is that this is perhaps related to some change in how system prompts are handled. I’m preparing a test framework to pressure test this assumption.

edit/update:

Zero change in behaviour as a result of moving the top-level instructions field into the input body as the developer role.

_j · June 17, 2026, 8:07am

This is Responses API, both conversation ID and self-management doing most of the high-cost calls. It is using primarily tool-calling on large initial context. Non-streaming and not background, as the ratio of 5-15 iterative tool calls before an output doesn’t have much need for a scrolling screen response after the long work. I don’t send a prompt cache key, nor 24h, as the system prompt is enough to be unique for the app, and I’m not making enough parallel calls to hit rotation out of a server instance by other similar usage.

slackermanz · June 17, 2026, 8:14am

Interesting. I’ll try that combination.

Do you happen to know how often your cache prefix gets rewritten or updates itself?

I’ve managed to work one of my context windows into a scenario where it gets reliable cache hits, but it’s always against the exact same OpenAI-side cache prefix, which is not growing or increasing or changing.

And I found that if I historically mutate a message within the prefix, it results in a 0% cache hit, but it doesn’t write a new prefix and then start using that one. The only way to restore cache hits against that seemingly sticky or static prefix that they’re serving me is to revert the mutation that I make.

_j · June 17, 2026, 8:36am

There’s not really a “cache prefix”. As the cache hit process and routing is described, the first 256 tokens is hashed, and the hash determines same-server routing. The prompt cache key you can send is also included in this hashing, so it can serve to break the cache hit - a feature only useful after significant parallel usage.

The database lookup method of 24hr storage is not described other than it persists longer, but no other strategy or minimum length requirement beyond 1024 tokens (actually needing and delivering on closer to 1200) has been documented or expected. So ensure there is no changing of the bulk of the start of context (which OpenAI does on you anyway when they change an injected date).

BTW, right now: [usage] in/cached: 135229/133376 | out/reasoning: 433/154

I had a second turn with no cache hit until it started iterating also.

slackermanz · June 17, 2026, 9:40am

When you say “the bulk of the start context”, do you mean the entire body of the context window submission?

I have a lot of context management mechanisms at play that definitely mutate the middle and later parts of the body, but certainly nothing anywhere near the start, which is usually incredibly stable.

What I’ve discovered so far, though, is that even absolute end-of-context-window tail mutations, if they get rolled into the whatever it is that they’re matching the cache against, If those tail mutations contain volatile data at all, the result is the entire stored cache block that I would be getting cache reads from is voided because it’s impossible to match the life-changing data at the very tail.

So even if I’ve got two megabytes of text that is completely stable, if I have a cache write that includes some unique data that changes from turn to turn, that also must be preserved exactly, otherwise I get a 100% cache miss.

But if I modify my system to ensure that the ephemeral state is never injected at the tail, I get a few cache hits until a cache write happens again, and that usually includes this volatile state, and then the whole system is broken again until whatever sticky mechanism resolves itself.

So those 256 tokens as described, is that quite literally the very first 256 user content tokens of the entire context window, and it doesn’t care about anything else? Because so far that doesn’t match my observations.

Everything so far seems to be pointing towards no resilience against volatile tail materials. It’s worth noting that nothing about my architecture has changed. This used to work just fine.

I wonder if I fingerprint the first 85 or 95 percent of my context window and hash that and mix that in with the prompt cache key. At least when it comes to middle-body mutations, if I can detect that before sending the API call, It would knowingly void the cache write that I was previously using and hopefully request a new one while also automatically dropping the volatile tail content.

Then from there it would just be a game of cat and mouse to reverse engineer and figure out exactly when cache-write events happen as a result of tail growth, rather than volatile inner contents.

At this point I’d be quite willing to intentionally poison whatever it is that causes the fingerprinting when I know that there’s been a mutation, and ensure that I resume from a cache-write state that is knowingly clear of any tail pollution.

I’ve identified some additional body mutation triggers, but that doesn’t change the issue of the absolute tail of the context window being cached. Even though it contains volatile data, and that partial matches to a written cache state can’t be matched. It seems to be the entire context window exactly, or nothing at all.

_j · June 17, 2026, 10:21am

Checklist - things that are likely non-cache inputs:

Calls to different models
Calls with different service tier
Calls with different prompt cache key
Calls past expiry (5-60 minutes)
Calls with framework injections of text such as UUIDs
Prompt IDs with variables, varying prompt ID versions
Not passing and maintaining a full chat history
Varying or dropping encrypted reasoning, or “phase”, in output being returned
Responses with any kind of compaction

Possible: different localization routing, different organization or project, etc. OpenAI running different determinism fingerprint models on varying hardware vs 24hr retrieval, etc.

Then the big one: Your actual API call, instructions + input is simply non-varying, only adding new inputs to a record of 100% fidelity.

Just from and for inspiration, I asked my AI pal starting with C for some tooling, a start of inspecting past string sequences you are sending in logs or “live”. Then far more “demo” presentation than needed for a token encoder + integer list matcher when you run.

"""
token_cache_diff.py
-------------------
Compares two tiktoken-encoded integer sequences to find where their shared
prefix ends, and reports whether that prefix qualifies for OpenAI's prompt
caching discount (≥ 1024 tokens, counted in 128-token increments).

Typical use: encode your prompt at each API call and pass both encoded
sequences here to pinpoint where early content mutations break cache
eligibility between runs.
"""

from __future__ import annotations

import random
from dataclasses import dataclass
from typing import Optional

try:
    import tiktoken  # only needed for the encode helper
except ImportError:
    tiktoken = None  # type: ignore


# ─────────────────────────────────────────────────────────────────────────────
# Result container
# ─────────────────────────────────────────────────────────────────────────────

@dataclass(frozen=True)
class TokenDiffResult:
    """Outcome of comparing two token-integer sequences."""

    matching_prefix_len: int
    """Number of tokens identical from index 0 up to (not including) the first break."""

    divergence_index: Optional[int]
    """Index of the first mismatched token, or None when one sequence is a
    clean prefix of the other (identical, extension, or truncation)."""

    divergence_type: str
    """
    'identical'  – sequences are byte-for-byte the same.
    'extension'  – candidate grew beyond reference with no mutations.
    'truncation' – candidate is shorter than reference with no mutations.
    'mutation'   – a token value differs at divergence_index.
    """

    divergent_tokens: Optional[tuple[int, int]]
    """(reference_token, candidate_token) at the divergence point, or None."""

    cache_eligible_len: int
    """Largest prefix length that qualifies for a caching discount.
    0 if the matching prefix is under the minimum threshold."""

    cache_tiers_hit: int
    """How many 128-token cache tiers are covered by cache_eligible_len."""


# ─────────────────────────────────────────────────────────────────────────────
# Core comparison
# ─────────────────────────────────────────────────────────────────────────────

def compare_token_sequences(
    reference: list[int],
    candidate: list[int],
    cache_min_tokens: int = 1024,
    cache_increment: int = 128,
) -> TokenDiffResult:
    """
    Compare two tiktoken integer sequences and report where they first diverge.

    A shared prefix is valid only when every token from index 0 up to (but not
    including) the first divergence is identical.  A sequence that is strictly
    longer with no mutations is treated as a clean extension, not a mutation.

    Args:
        reference:        The baseline / earlier token sequence.
        candidate:        The later token sequence being compared.
        cache_min_tokens: Minimum matching prefix for a caching discount (default 1024).
        cache_increment:  Cache-tier size in tokens (default 128).

    Returns:
        TokenDiffResult with the matching length, divergence position/type,
        and the largest cache-eligible prefix length.

    Examples:
        >>> compare_token_sequences([1, 2, 3], [1, 2, 3]).divergence_type
        'identical'
        >>> compare_token_sequences([1, 2, 3], [1, 2, 3, 4]).divergence_type
        'extension'
        >>> compare_token_sequences([1, 2, 3, 4], [1, 2, 3]).divergence_type
        'truncation'
        >>> r = compare_token_sequences([1, 2, 9, 4], [1, 2, 3, 4])
        >>> r.divergence_index, r.matching_prefix_len
        (2, 2)
    """
    min_len = min(len(reference), len(candidate))

    # Walk only the overlapping portion looking for the first mismatch.
    divergence_index: Optional[int] = None
    for i in range(min_len):
        if reference[i] != candidate[i]:
            divergence_index = i
            break

    # Matching prefix is everything before the break (or the full overlap).
    matching_prefix_len = divergence_index if divergence_index is not None else min_len

    # Classify the relationship between the two sequences.
    if divergence_index is not None:
        divergence_type = "mutation"
    elif len(reference) == len(candidate):
        divergence_type = "identical"
    elif len(candidate) > len(reference):
        divergence_type = "extension"
    else:
        divergence_type = "truncation"

    divergent_tokens: Optional[tuple[int, int]] = None
    if divergence_index is not None:
        divergent_tokens = (reference[divergence_index], candidate[divergence_index])

    # Largest prefix that lands on a cache-tier boundary.
    cache_eligible_len = 0
    cache_tiers_hit = 0
    if matching_prefix_len >= cache_min_tokens:
        tiers = (matching_prefix_len - cache_min_tokens) // cache_increment
        cache_eligible_len = cache_min_tokens + tiers * cache_increment
        cache_tiers_hit = tiers + 1  # the first tier counts as tier 1

    return TokenDiffResult(
        matching_prefix_len=matching_prefix_len,
        divergence_index=divergence_index,
        divergence_type=divergence_type,
        divergent_tokens=divergent_tokens,
        cache_eligible_len=cache_eligible_len,
        cache_tiers_hit=cache_tiers_hit,
    )


# ─────────────────────────────────────────────────────────────────────────────
# Convenience wrapper — encodes text first, then compares
# ─────────────────────────────────────────────────────────────────────────────

def compare_text_inputs(
    reference_text: str,
    candidate_text: str,
    model: str = "gpt-4o",
    cache_min_tokens: int = 1024,
    cache_increment: int = 128,
) -> TokenDiffResult:
    """
    Encode both strings with tiktoken and delegate to compare_token_sequences.

    Args:
        reference_text:   The earlier / baseline prompt string.
        candidate_text:   The later prompt string to compare.
        model:            The OpenAI model name used to select the tokeniser.
        cache_min_tokens: Minimum matching prefix for a caching discount.
        cache_increment:  Cache-tier size in tokens.

    Returns:
        TokenDiffResult (same as compare_token_sequences).

    Raises:
        ImportError: if tiktoken is not installed.
    """
    if tiktoken is None:
        raise ImportError("tiktoken is required: pip install tiktoken")

    enc = tiktoken.encoding_for_model(model)
    return compare_token_sequences(
        list(enc.encode(reference_text)),
        list(enc.encode(candidate_text)),
        cache_min_tokens=cache_min_tokens,
        cache_increment=cache_increment,
    )


# ─────────────────────────────────────────────────────────────────────────────
# Console display helpers
# ─────────────────────────────────────────────────────────────────────────────

_W = 72  # inner width of each box row (chars between the two ║ borders)


def _rule(char: str = "─") -> str:
    return char * _W


def _header(title: str) -> str:
    return (
        f"╔{_rule('═')}╗\n"
        f"║  {title:<{_W - 2}}║\n"
        f"╠{_rule('═')}╣"
    )


def _divider() -> str:
    return f"╠{_rule('═')}╣"


def _footer() -> str:
    return f"╚{_rule('═')}╝"


def _row(text: str = "") -> str:
    return f"║{text:<{_W}}║"


def _body(lines: list[str]) -> str:
    return "\n".join(_row(line) for line in lines)


def _print_box(title: str, sections: list[list[str]]) -> None:
    """Print a box with a title bar and one or more content sections."""
    print(_header(title))
    for i, section in enumerate(sections):
        if i:
            print(_divider())
        print(_body(section))
    print(_footer())


def _tier_bar(
    eligible_len: int,
    matched_len: int,
    cache_min: int = 1024,
    cache_inc: int = 128,
) -> str:
    """Compact tier bar: ▓ = covered tier, ░ = reachable but not yet crossed."""
    if matched_len < cache_min:
        return "n/a  (below minimum threshold)"
    max_tiers = (matched_len - cache_min) // cache_inc + 1
    hit_tiers = (eligible_len - cache_min) // cache_inc + 1
    bar = "▓" * hit_tiers + "░" * (max_tiers - hit_tiers)
    next_boundary = cache_min + hit_tiers * cache_inc
    tokens_to_next = next_boundary - matched_len
    suffix = f"  (+{tokens_to_next} to tier {hit_tiers + 1})" if tokens_to_next > 0 else ""
    return f"[{bar}]  {hit_tiers}/{max_tiers}{suffix}"


def _result_rows(
    result: TokenDiffResult,
    ref_len: int,
    cand_len: int,
    cache_min: int = 1024,
    cache_inc: int = 128,
) -> list[str]:
    """Build the content rows for a comparison result section inside a box."""
    rows: list[str] = []
    delta = cand_len - ref_len
    sign = "+" if delta >= 0 else ""

    rows.append(f"  Reference length  : {ref_len:,} tokens")
    rows.append(f"  Candidate length  : {cand_len:,} tokens  ({sign}{delta:,})")
    rows.append("")

    icon = {
        "identical": "≡", "extension": "→", "truncation": "←", "mutation": "✗"
    }.get(result.divergence_type, "?")
    rows.append(f"  Divergence type   : {icon}  {result.divergence_type}")
    rows.append(f"  Prefix matched    : {result.matching_prefix_len:,} tokens  (raw)")

    if result.divergence_index is not None:
        rt, ct = result.divergent_tokens  # type: ignore[misc]
        rows.append(
            f"  First break       : index {result.divergence_index:,}"
            f"  [ref={rt}  cand={ct}]"
        )

    rows.append("")

    if result.cache_eligible_len:
        rows.append(f"  Cache-eligible    : {result.cache_eligible_len:,} tokens")
        rows.append(
            "  Tier bar          : "
            + _tier_bar(result.cache_eligible_len, result.matching_prefix_len,
                        cache_min, cache_inc)
        )
        rows.append("")
        rows.append(f"  ✓  Valid cache prefix — {result.cache_tiers_hit} tier(s) covered")
    else:
        rows.append(f"  Cache-eligible    : 0  (need >= {cache_min:,} matching tokens)")
        short_by = cache_min - result.matching_prefix_len
        rows.append(
            f"  Raw prefix only   : {result.matching_prefix_len:,} tokens"
            + (f"  (short by {short_by:,})" if short_by > 0 else "")
        )
        rows.append("")
        rows.append("  ✗  No cache discount — prefix too short or mutated")

    return rows


# ─────────────────────────────────────────────────────────────────────────────
# Demo — simulated multi-turn chat context with caching diagnostics
# ─────────────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    SEED      = 42
    CACHE_MIN = 1024
    CACHE_INC = 128
    random.seed(SEED)

    # ── Build simulated token sequences ──────────────────────────────────────
    # Token IDs are random integers in the realistic GPT-4o tiktoken range.

    BASE_LEN   = 1_500  # system prompt + previous conversation context
    ROUND1_LEN = 210    # first new user message (Turn 1)
    ROUND2_LEN = 195    # second new user message (Turn 2)

    base   = [random.randint(1, 50_256) for _ in range(BASE_LEN)]
    round1 = [random.randint(1, 50_256) for _ in range(ROUND1_LEN)]
    round2 = [random.randint(1, 50_256) for _ in range(ROUND2_LEN)]

    seq_r0 = base                        # 1,500 — initial cached context
    seq_r1 = base + round1               # 1,710 — after Round 1
    seq_r2 = base + round1 + round2      # 1,905 — after Round 2

    # Branch: silently mutate one token deep inside the base context, then
    # re-append the same round1 and round2 suffixes.  Total length unchanged.
    MUTATION_IDX  = 47
    mutated_base  = base[:]
    mutated_base[MUTATION_IDX] = (mutated_base[MUTATION_IDX] + 999) % 50_256
    seq_branch    = mutated_base + round1 + round2  # 1,905 — index 47 is wrong

    # ── Intro ─────────────────────────────────────────────────────────────────
    print()
    _print_box(
        "  TOKEN CACHE PREFIX DIFF — MULTI-TURN DEMO",
        [[
            "  Simulated tiktoken integer sequences (no real model call needed).",
            "  Each turn compares the previous full prompt against the new one,",
            "  mirroring how you would call compare_token_sequences() in practice.",
            "",
            f"  Cache discount rule: prefix >= {CACHE_MIN:,} tokens, aligned to",
            f"  {CACHE_INC}-token tiers  (1024 -> 1152 -> 1280 -> 1408 -> ...)",
            "",
            "  Tier bar key:  ▓ = cache tier covered   ░ = tier within reach",
        ]],
    )
    print()

    # ── Turn 0: base context seeded ───────────────────────────────────────────
    _print_box(
        "  TURN 0  ·  Base Context Seeded  (seed=42)",
        [[
            f"  {BASE_LEN:,} tokens generated — system prompt + prior assistant turns",
            "  already present in the context window.",
            "",
            "  Stored as the cache reference.  No comparison yet.",
        ]],
    )
    print()

    # ── Turn 1: first user round ──────────────────────────────────────────────
    r1 = compare_token_sequences(seq_r0, seq_r1, CACHE_MIN, CACHE_INC)

    _print_box(
        "  TURN 1  ·  Round 1 User Input  (+210 tokens appended)",
        [
            [
                f"  {ROUND1_LEN} new user-message tokens appended to the base context.",
                "  ref = stored cache (Turn 0)    cand = new full prompt",
            ],
            _result_rows(r1, len(seq_r0), len(seq_r1), CACHE_MIN, CACHE_INC),
        ],
    )
    print()

    # ── Turn 2: second user round ─────────────────────────────────────────────
    r2 = compare_token_sequences(seq_r1, seq_r2, CACHE_MIN, CACHE_INC)

    tier_delta = r2.cache_tiers_hit - r1.cache_tiers_hit
    if tier_delta > 0:
        tier_note = (
            f"  ↑  +{tier_delta} tier(s) vs Turn 1 — grew past"
            f" {tier_delta} x {CACHE_INC}-token boundary(s)."
        )
    else:
        tier_note = "  —  No new cache tier boundary crossed since Turn 1."

    _print_box(
        "  TURN 2  ·  Round 2 User Input  (+195 tokens appended)",
        [
            [
                f"  {ROUND2_LEN} more tokens appended.  Context keeps growing cleanly.",
                "  ref = Turn 1 full prompt       cand = Turn 2 full prompt",
            ],
            _result_rows(r2, len(seq_r1), len(seq_r2), CACHE_MIN, CACHE_INC),
            [tier_note],
        ],
    )
    print()

    # ── Turn 3: branch / mutation ─────────────────────────────────────────────
    r3 = compare_token_sequences(seq_r2, seq_branch, CACHE_MIN, CACHE_INC)

    _print_box(
        "  TURN 3  ·  Branch — Early Mutation Detected",
        [
            [
                f"  Token at index {MUTATION_IDX} was silently changed inside the base context.",
                f"  Total length is unchanged ({len(seq_branch):,} tokens) — mutation is subtle.",
                "  ref = Turn 2 full prompt       cand = mutated branch",
            ],
            _result_rows(r3, len(seq_r2), len(seq_branch), CACHE_MIN, CACHE_INC),
            [
                f"  ⚠  Prefix breaks at index {r3.divergence_index}."
                f"  All {r2.cache_tiers_hit} previously earned tier(s) wiped.",
                "  Server must recompute the KV-cache from scratch.",
                "",
                "  Common causes of early mutation:",
                "    · Timestamp / request-ID injected into system prompt",
                "    · Dynamic fields (username, locale) placed before static content",
                "    · Tool-call results inserted ahead of the stable context block",
            ],
        ],
    )
    print()

slackermanz · June 17, 2026, 10:53am

I greatly appreciate the detailed response here and the effort put in. I’ll pass this to my agents and have them prepare comprehensive responses to each of these points and identify all of the possible mistakes we’re making.

That said, several of these I can answer myself directly:

Calls to different models:
Sometimes, but this can be confidently ruled out as a contributing factor, as I’ve not been varying the models involved while doing any testing, and generally I stick with GPT 5.5 in the exact same configuration.

Calls with different service tier:
Directly ruled out. I don’t modify this at any point.

Calls with different prompt cache key:
For the same context window, I have experimented with both stable prompt cache keys as well as salting them when I get several cache misses in a row. Currently, I’m running with it strictly stable, and it hasn’t materially changed anything.

Calls past expiry (5-60 minutes):
Current testing is being done with a 24-hour request and a prompt cache key, though my API calls are typically far faster than 5-60 minutes apart. They would usually be seconds to minutes apart. Even then, with this quick turnaround time, I still experience very quick regressions to 0% hit rates.

Calls with framework injections of text such as UUIDs:
There is volatile content injected as a prepended block above the most recent user role message only. This can range from things like the time of day down to the second, or recent file modifications and other volatile contents. From everything I’ve researched and looked into, tail volatility should not be a problem here. Yet, it seems to be the primary contributor.

It may not be as overt as constantly injecting random UUIDs that change turn to turn, but the fingerprinting of it would be effectively the same.

Prompt IDs with variables, varying prompt ID versions:
As far as I’m aware, I’m not using any system that matches this description. Is this one of the extended features of the Responses API? One of the walled garden features, perhaps?

Not passing and maintaining a full chat history:
As far as my context windows exist turn to turn, the verbatim contents is that they are complete right from the very first message. That is to say, I’m not doing any kind of truncation or sliding window context management that would drop a whole chunk at the start of the context window.

I have comprehensive diff-based fingerprinting of the at-wire-flight-time request payloads turn-to-turn that show constant stability of the request payload shapes.

Varying or dropping encrypted reasoning
The reasoning blocks are all entirely dropped. They never enter into the context window. The only content block types that I submit or deal with are text and very occasionally, image.

If the cache writes are happening based on the response before it is delivered to my application, including their expectation of the encrypted reasoning contents, then that may be a plausible explanation. But I have also attempted to test this, and it doesn’t seem to have any meaningful effect. It would be my expectation that they would do cache writes based on what I submit, not based on what they return.

phase in output being returned:
I manually checked my wire logs, and at no point am I replaying any phase field data.

Responses with any kind of compaction:
I have my own client-side compaction mechanism, which is indistinguishable from simply sending a different context window. So, I’m not using any API-provided black-box compaction. It is just simply a different payload after using it. It is infrequent and stable between API calls / submission turns.

Then the big one: Your actual API call, instructions + input is simply non-varying, only adding new inputs to a record of 100% fidelity.
Could you just clarify what you mean by this one? Because if this is a specification of the narrow requirements of getting a cache hit, then I’m absolutely not satisfying that. However, up until about a week and a half ago, there was no problem with the high mutability of my context windows.

That’s really the thing here; my my context windows have always been volatile but all of a sudden my my ability to hit cache at all has dropped to zero.

I’ll continue experimenting and investigating and drop updates when I make any progress.

slackermanz · June 17, 2026, 12:10pm

I’ve made good progress so far.

The first smaller win was identifying a previously unknown source of body mutation through my existing fingerprinting mechanisms. Though cleaning that up was just an improvement to the telemetry and the ability to diagnose this further itself.

That is to say, my application still heavily depends on and induces severe body content mutations.

What I do have is very promising experimental results that indicate that when cache writes do occur, they record the exact state including the volatile tail HUD contents I was discussing.

So when that induces a cache miss, as far as I can tell, that cache miss comes with a corresponding cache write, but because matching the volatile content is impossible, I can never get a cache hit in future while that HUD system is active.

The initial promising results are that if I deliberately halt any body content churn and disable the state readout HUD, I can induce a cache write, essentially rebasing the cache state.

It’s an ugly workaround that wasn’t necessary two weeks ago, but it at least shows some promise and some degree of controllability for high volatility context windows.

What hasn’t yet been determined is what conditions, if any, induce a tail-growth case-write event.

slackermanz · June 18, 2026, 6:36am

So far I’ve made significant modifications to the way my application handles the context windows.

It seems the invariant is that whenever a cache write occurs, it is for the full context window, including and right up to the very last byte of the tail content, no matter the volatility, and it refuses to deliver a cache hit on anything other than the whole complete and exact context window.

This divergence in expectations from the documentation as far as I’ve been able to understand it, and from the previous behaviour before about June 11th does seem to be roughly correlated with the migration or inclusion of Azure as an inference provider.

The best I’ve been able to achieve so far is a double-cache bust rewrite.

Essentially, when an unexpected cache miss occurs, I enter into maintenance mode, clean up the context window, disable any volatile readouts that would be appended at the tail, and then deliberately cache bust again.

This forces a new cache write in a clean state without any volatile contents, which I can then grow the tail of.

When the tail grows too large, it’s my intention to mutate a otherwise static hash placed in the very top of the context window …while also deliberately entering one of these maintenance windows.

As best I can understand, this is not how the cache mechanisms are supposed to work. Does anybody know if this behaviour is actually expected or documented? I thought partial prefix matching was the expected behaviour.

It’s been a very significant refactor effort, but it’s starting to stabilise.

Topic		Replies	Views
We need to talk about prompt caching Feedback prompt-caching , responses-api , chat-completions-api	1	845	October 25, 2025
Perceived Drop in GPT-5 Quality Over the Last Few Weeks Codex gpt-5-codex , gpt-5-5	32	1297	June 3, 2026
Cache not caching more than 1024 tokens (expected: increments of 128 tokens) Bugs prompt-caching	6	454	November 14, 2024
Structured Outputs not reliable with GPT-4o-mini and GPT-4o API structured-output	38	9734	January 23, 2025
Codex Rate Limits Discussion Thread Codex rate-limit	379	23658	June 17, 2026

Persistent 0% prompt cache hits on GPT-5.5 with Auckland NZ Cloudflare 520s complicating every workaround

Related topics