Prompt cache: documented byte-prefix matching does not occur on gpt-5.4 / gpt-5.5 when trailing user content exceeds ~500 tokens

We have a fully reproducible setup where OpenAI’s prompt cache fails to return a byte-identical prefix above the 1024-token minimum, on both `gpt-5.4` and `gpt-5.5`, whenever the trailing user message exceeds approximately 500 tokens.

Across six controlled tests we observe:

  • **C0 (cold write):** `cached_tokens=0` — expected.

- **C1 (identical re-send 30s later):** `cached_tokens ≈ 99.6%` of input — the cache **did** write successfully under our `prompt_cache_key`.

- **C2 (same system, *different* trailing user message, 30s later):** `cached_tokens=0` — even though the leading ~67K tokens of the prompt are byte-identical to C0/C1, and that exact prefix is sitting in the cache shard we just demonstrated on C1.

This contradicts the documented behavior:

> “Cache hits require an exact, repeated prefix match and works automatically for prompts containing 1024 tokens or more, with cache hits occurring in increments of 128 tokens.” — *OpenAI prompt-caching guide*

The Prompt Caching 201 cookbook is also explicit that there is no documented size limit on the trailing dynamic portion that affects cache validity.

## Reproducible test setup

### API shape

- Endpoint: `client.chat.completions.create` (also reproduced via `client.beta.chat.completions.parse` — same result)

- Messages: `[{“role”: “system”, “content”: SYSTEM}, {“role”: “user”, “content”: USER}]`

- `extra_body.prompt_cache_key`: explicit, same string across both calls in each test

- `extra_body.prompt_cache_retention`: `“24h”`

- `extra_body.reasoning_effort`: `“low”` (also tested with this param entirely omitted — same result)

- `max_completion_tokens`: 4000

### Inputs

- **System message:** 267,140 chars / ~66,786 tokens. Byte-identical between every pair of calls within a given test. A fresh nonce is prepended to guarantee no inherited cache from prior probes.

- **user_a:** 80,041 chars / ~20,010 tokens. Real PR diff (shard 0).

- **user_b:** 79,265 chars / ~19,816 tokens. Real PR diff (shard 1) — **different content**, same shape.

- **prompt_cache_key:** fresh UUID per test, identical between the calls within one test.

### Test pattern

Three calls per test, all sharing the same `prompt_cache_key`:

1. **C0** (T+0): `system + user_a` — expected MISS, this is the cold write.

2. **C1** (T+~42s): `system + user_a` IDENTICAL — expected HIT, proves the cache wrote.

3. **C2** (T+~85s): `system + user_b` — same system bytes, *different* trailing user. Per docs this should hit on the leading ~67K-token prefix. Observed: 0% cache.

## Results — six independent tests

| # | Variant | Model | C0 `cached` | C1 `cached` (identical) | **C2 `cached` (diff user)** |

|—|—|—|—|—|—|

| 1 | Production path (`beta.parse`, schema, retention=24h) | gpt-5.5 | 0 (0.0%) | 83,200 (99.6%) | **0 (0.0%)** |

| 2 | Bare path (`.create`, no schema, no retention) | gpt-5.5 | 0 (0.0%) | 83,200 (99.6%) | **0 (0.0%)** |

| 3 | 3-message split: `[system, static-user, dynamic-user]` | gpt-5.5 | 0 (0.0%) | 83,200 (99.6%) | **0 (0.0%)** |

| 4 | `reasoning_effort` omitted entirely | gpt-5.5 | 0 (0.0%) | 83,200 (99.6%) | **0 (0.0%)** |

| 5 | User-size sweep — see threshold table below | gpt-5.5 | — | — | varies by size |

| 6 | gpt-5.4 with identical setup as #2 | gpt-5.4 | 0 (0.0%) | 83,200 (99.6%) | **0 (0.0%)** |

In every test C1 returns 83,200 cached tokens — proof the cache wrote successfully under the shared `prompt_cache_key`. Yet C2 (querying the same key, byte-identical system bytes, different trailing user) returns 0.

### Threshold characterization (test #5**)**

Fixed 67K-token system + varying-size diff-user pair (`user_a` vs `user_b` sliced to target token count):
| user message tokens | C2 `cached_tokens` | C2 % | Result |

|—|—|—|—|

| ~5 (earlier baseline) | ~67,000 | ~99.8% | :white_check_mark: HIT |

| 50 | 61,696 | 99.3% | :white_check_mark: HIT |

| **500** | **0** | **0.0%** | :cross_mark: MISS |

| 2,000 | 0 | 0.0% | :cross_mark: MISS |

| 10,000 | 0 | 0.0% | :cross_mark: MISS |

| 20,000 | 0 | 0.0% | :cross_mark: MISS |

The cliff sits between **50 and 500 tokens of trailing user content**. Below it, byte-prefix matching works as documented (~99% cache hit on the static prefix). Above it, cache hit drops to 0% — even though the static prefix is byte-identical and demonstrably resident in cache (proven by C1).

## What we have ruled out

Every variable we could vary across the six tests:

| Variable | Tested as | Effect on C2 cache hit |

|—|—|—|

| `response_format` (schema) | with vs without | none — both 0% |

| `beta.parse` vs `chat.completions.create` | both | none — both 0% |

| `prompt_cache_retention` (`“24h”`) | with vs without | none — both 0% |

| TokenBucket / client-side throttling | with vs without | none — both 0% |

| Message structure (2-msg vs 3-msg split with static user message in its own message) | both | none — both 0% |

| `reasoning_effort` (`“low”`) | with vs omitted | none — both 0% |

| Model (`gpt-5.5` vs `gpt-5.4`) | both | none — both 0% |

The only variable that matters is **the size of the trailing user message**. Above the ~500-token threshold, cache fails regardless of every other setting.

## Attachments available on request

- Six self-contained Python probe scripts that reproduce the behavior end to end. Each script is < 200 lines, runs in ~2–7 minutes, and prints the `cached_tokens` field directly from `response.usage.prompt_tokens_details`.

- Full request/response JSON payloads for any of the runs, including `x-request-id` headers if needed for server-side correlation.

- Logs from our production review pipeline showing 0% within-PR cache reuse across 30 days of operation.

Thank you.

1 Like