Consistent cache breaks with o4-mini and previous_response_id

Caching is based on input. Matching against prior inputs, not assistant outputs. So if you were to never send back a reasoning (or employ by its reasoning id), you should be building on an input that doesn’t have any breaks. reasoning + assistant as new input should be similar to assistant only (with much faster context growth for dubious benefit except that it continues to retrain the AI).

It would be lookback meddling that could give a cache break. Someone’s decision, “we’ll only keep a few reasoning outputs”. But that should not behave as an all-or-none like you show in your last log; eventually chat would grow enough that there’d be some reuse.


The best design would be that response ID itself is a k-v cache, not just messages. The message list is not mutable, which facilitates that. However, one could guess that the state data of a model is much larger than its possible context tokens as input, impractical infrastructure.

What’s a hit…

Caching is described as “best effort to route you to be serviced by same inference server” or similar language.

How would that be done (and how on your thousands of calls a minute)?

Could it be that there’s some quick hashing of inputs, finding that initial commonality, for routing? But - that using a previous response ID in a Responses API call is not offering “input” inspectable at that layer? One can only speculate in the dark, but such an architecture to service API calls might also be a reason for low cache hits.

Response ID, you’d think, instead, would be an easy path back to a cache waiting on an inference server.