Initially I wanted the prefix to be cached once and then be referenced by cache_id/prompt_cache_key or something that would eliminate putting the prefix before every query, but it doesn’t seem possible.
I have a prefix greater than 1024 tokens that is being used before all the user queries, and I was looking on how the prompt caching would improve input tokens costing. What impact does it really bring in improving cache hit rates as the latency and cached_tokens I observed with or without “prompt_cache_key” were similar.
around 256 tokens + prompt_cache_key field are hashed
this is used for best effort to reuse the same server instance
if it does not match, load distribution is (should be) encouraged instead
one cached host might do 15 calls per minute before you get rolled over to a different non-cache instance anyway (a number that is completely arbitrary and probably fictionalized, because 15 “hi” to gpt-5-nano is not 15 “resolve this QED formula” to o3-pro.)
What it can do is not consume the cache server computation with different calls that start the same.
If you have many users with the same prefix, you have to decide if 1024 tokens of system prompt discount or second turns of chat are more important to discount in common, considering expiry.
I assume though that including previous_response_id makes it kind of superfluous in those cases, cause it would be routed to the system that handled previous response anyway? So this is just for the first requests.
A cache is local to an instance of a “server” or whatever size of local “compute units” run and share and retain data.
Case against response ID helping cache
A KV context window cache is specific to a particular AI model, whereas you can use previous_response_id with a different AI model in the next turn. This internal state also could be quite large. Therefore: it is unlikely the cache ever becomes part of an account database anywhere to be retrieved.
Others have a cache dB
Google lets you store your own persistent cache by deliberate effort, and specify it by ID. Think: OpenAI’s “prompts presets”, but with a built-in discount.
When to employ the parameter
Besides your focus on caching “app” vs caching “chats”; the big thing that I think is important:
You can penalize your performance if you have 256+ tokens in common with other calls (which is even just the size of the “vision safety message” that OpenAI injects), thus sending everything to the same instance. Without differentiation by changing input, or indeed this parameter to break the routing algorithm, the “hash” determines that same-server routing is preferable - even though you’d never realize a discount.
I haven’t tested this particular case of breaking common patterns with prompt_cache_key to ensure load distribution (vs just making the calls with no caching thought), to benchmark and see speed differences quantitatively in batches of runs, but it seems logical that varying the parameter ensures you don’t concentrate your computation in one place, to bog down inference if OpenAI’s “rollover” is naive.
I did note early on that inputs that could be cacheable, but were below the discount threshold, had a similar performance penalty (or advantage?) to discount-sized calls. My inference then was not about the routing, but that OpenAI might enjoy cache optimization and simply not share the discount - and had similar conclusion before the discount was even introduced.
“Prompts” in Responses is a shape that is exactly perfect for database-retrieved context window cache precomputation - without discount delivered.
You choose a fixed model, are encouraged to place all the tools and functions, system message, and multi-shot into one container.
The only thing that is a breaking anti-pattern is prompts’ variables that can be used early in the input. However, there is context cache technology that can break down to matching smaller modular blocks in position-independent cache (web paper). Also prefill-decode disaggregation (PDF) for cross-model cache.
Explains why it is made hard to create the presets as disposable entities, and only in a UI.