Prompt cache: documented byte-prefix matching does not occur on gpt-5.4 / gpt-5.5 when trailing user content exceeds ~500 tokens

shubh49 · June 19, 2026, 1:47pm

We have a fully reproducible setup where OpenAI’s prompt cache fails to return a byte-identical prefix above the 1024-token minimum, on both `gpt-5.4` and `gpt-5.5`, whenever the trailing user message exceeds approximately 500 tokens.

Across six controlled tests we observe:

**C0 (cold write):** `cached_tokens=0` — expected.

- **C1 (identical re-send 30s later):** `cached_tokens ≈ 99.6%` of input — the cache **did** write successfully under our `prompt_cache_key`.

- **C2 (same system, *different* trailing user message, 30s later):** `cached_tokens=0` — even though the leading ~67K tokens of the prompt are byte-identical to C0/C1, and that exact prefix is sitting in the cache shard we just demonstrated on C1.

This contradicts the documented behavior:

> “Cache hits require an exact, repeated prefix match and works automatically for prompts containing 1024 tokens or more, with cache hits occurring in increments of 128 tokens.” — *OpenAI prompt-caching guide*

The Prompt Caching 201 cookbook is also explicit that there is no documented size limit on the trailing dynamic portion that affects cache validity.

## Reproducible test setup

### API shape

- Endpoint: `client.chat.completions.create` (also reproduced via `client.beta.chat.completions.parse` — same result)

- Messages: `[{“role”: “system”, “content”: SYSTEM}, {“role”: “user”, “content”: USER}]`

- `extra_body.prompt_cache_key`: explicit, same string across both calls in each test

- `extra_body.prompt_cache_retention`: `“24h”`

- `extra_body.reasoning_effort`: `“low”` (also tested with this param entirely omitted — same result)

- `max_completion_tokens`: 4000

### Inputs

- **System message:** 267,140 chars / ~66,786 tokens. Byte-identical between every pair of calls within a given test. A fresh nonce is prepended to guarantee no inherited cache from prior probes.

- **user_a:** 80,041 chars / ~20,010 tokens. Real PR diff (shard 0).

- **user_b:** 79,265 chars / ~19,816 tokens. Real PR diff (shard 1) — **different content**, same shape.

- **prompt_cache_key:** fresh UUID per test, identical between the calls within one test.

### Test pattern

Three calls per test, all sharing the same `prompt_cache_key`:

1. **C0** (T+0): `system + user_a` — expected MISS, this is the cold write.

2. **C1** (T+~42s): `system + user_a` IDENTICAL — expected HIT, proves the cache wrote.

3. **C2** (T+~85s): `system + user_b` — same system bytes, *different* trailing user. Per docs this should hit on the leading ~67K-token prefix. Observed: 0% cache.

## Results — six independent tests

|—|—|—|—|—|—|

| 1 | Production path (`beta.parse`, schema, retention=24h) | gpt-5.5 | 0 (0.0%) | 83,200 (99.6%) | **0 (0.0%)** |

| 2 | Bare path (`.create`, no schema, no retention) | gpt-5.5 | 0 (0.0%) | 83,200 (99.6%) | **0 (0.0%)** |

| 3 | 3-message split: `[system, static-user, dynamic-user]` | gpt-5.5 | 0 (0.0%) | 83,200 (99.6%) | **0 (0.0%)** |

| 4 | `reasoning_effort` omitted entirely | gpt-5.5 | 0 (0.0%) | 83,200 (99.6%) | **0 (0.0%)** |

| 5 | User-size sweep — see threshold table below | gpt-5.5 | — | — | varies by size |

| 6 | gpt-5.4 with identical setup as #2 | gpt-5.4 | 0 (0.0%) | 83,200 (99.6%) | **0 (0.0%)** |

In every test C1 returns 83,200 cached tokens — proof the cache wrote successfully under the shared `prompt_cache_key`. Yet C2 (querying the same key, byte-identical system bytes, different trailing user) returns 0.

### Threshold characterization (test #5**)**

|—|—|—|—|

| ~5 (earlier baseline) | ~67,000 | ~99.8% | HIT |

| 50 | 61,696 | 99.3% | HIT |

| **500** | **0** | **0.0%** | MISS |

| 2,000 | 0 | 0.0% | MISS |

| 10,000 | 0 | 0.0% | MISS |

| 20,000 | 0 | 0.0% | MISS |

The cliff sits between **50 and 500 tokens of trailing user content**. Below it, byte-prefix matching works as documented (~99% cache hit on the static prefix). Above it, cache hit drops to 0% — even though the static prefix is byte-identical and demonstrably resident in cache (proven by C1).

## What we have ruled out

Every variable we could vary across the six tests:

| Variable | Tested as | Effect on C2 cache hit |

|—|—|—|

| `response_format` (schema) | with vs without | none — both 0% |

| `beta.parse` vs `chat.completions.create` | both | none — both 0% |

| `prompt_cache_retention` (`“24h”`) | with vs without | none — both 0% |

| TokenBucket / client-side throttling | with vs without | none — both 0% |

| Message structure (2-msg vs 3-msg split with static user message in its own message) | both | none — both 0% |

| `reasoning_effort` (`“low”`) | with vs omitted | none — both 0% |

| Model (`gpt-5.5` vs `gpt-5.4`) | both | none — both 0% |

The only variable that matters is **the size of the trailing user message**. Above the ~500-token threshold, cache fails regardless of every other setting.

## Attachments available on request

- Six self-contained Python probe scripts that reproduce the behavior end to end. Each script is < 200 lines, runs in ~2–7 minutes, and prints the `cached_tokens` field directly from `response.usage.prompt_tokens_details`.

- Full request/response JSON payloads for any of the runs, including `x-request-id` headers if needed for server-side correlation.

- Logs from our production review pipeline showing 0% within-PR cache reuse across 30 days of operation.

Thank you.

dreamer_93 · June 24, 2026, 4:24am

Facing the same issue myself. The exact same series of prompts cache well on gpt 5.2. But for 5.4, only the system instructions are cached across calls. Each call builds upon the previous, only differences is the last few set of messages are different across calls.

@shubh49 were you able to solve this.

dreamer_93 · June 24, 2026, 3:21pm

Started working for me. Had to explicitly pass the prompt_cache_retention param. From the API reference I thought that caching should be enabled by default. But for GPT 5.4, only worked on passing the param. Was using the python sdk.

PaulBellow · June 24, 2026, 3:25pm

Thanks for coming back to let us know, @dreamer_93 … and welcome to the community! Hope you stick around. We’ve got a wealth of info and a lot of great people here.

shubh49 · June 25, 2026, 1:23pm

@dreamer_93 Had tried using PROMPT_CACHE_RETENTION, but it didn’t work. The issue seems that entire prompt is being considered for bytes matching instead of maximum common prefix bytes.

plopinou · June 27, 2026, 6:25am

Hitting this issue heavily while using GPT-5.5 with Cecli (fork of Aider). Even with no repo map and a confirmed (by tests) byte to byte stable prefix representing 95%+ of a request (mostly stable file contents), simply exchanging with the model will naturally add more than 500 tokens to the conversation and make the cache nearly always miss.

In the current state, the cache is completely useless except for tasks that only use a fixed prompt that do not have incrementally appended data.

Since 80% of my bill are from input tokens, I’m currently highly overpaying compared to models with working caches, to the point where I have to constantly micro-manage the context between each request because I know the cache is useless.

shubh49 · June 27, 2026, 7:04am

Update -
This issue seems to be isolated to chat.completions API only. I tried using Responses API and the partial prefix caching seems to be working in responses API.

Topic		Replies	Views
We need to talk about prompt caching Feedback prompt-caching , responses-api , chat-completions-api	1	923	October 25, 2025
Prompt caching returns cached_tokens=0 intermittently — same static prefix, same prompt_cache_key Bugs prompt-caching	2	102	July 10, 2026
Cache not caching more than 1024 tokens (expected: increments of 128 tokens) Bugs prompt-caching	6	468	November 14, 2024
Prompt Caching Not Working for GPT-5.4-Nano Bugs api	1	179	May 24, 2026
Prompt Caching seems not working even if long common prefix in the system prompt API prompt , gpt-5 , prompt-caching , gpt-41 , gpt-41-mini	3	618	September 23, 2025

Prompt cache: documented byte-prefix matching does not occur on gpt-5.4 / gpt-5.5 when trailing user content exceeds ~500 tokens

Related topics