A cache is local to an instance of a “server” or whatever size of local “compute units” run and share and retain data.
Case against response ID helping cache
A KV context window cache is specific to a particular AI model, whereas you can use previous_response_id with a different AI model in the next turn. This internal state also could be quite large. Therefore: it is unlikely the cache ever becomes part of an account database anywhere to be retrieved.
Others have a cache dB
Google lets you store your own persistent cache by deliberate effort, and specify it by ID. Think: OpenAI’s “prompts presets”, but with a built-in discount.
When to employ the parameter
Besides your focus on caching “app” vs caching “chats”; the big thing that I think is important:
- You can penalize your performance if you have 256+ tokens in common with other calls (which is even just the size of the “vision safety message” that OpenAI injects), thus sending everything to the same instance. Without differentiation by changing input, or indeed this parameter to break the routing algorithm, the “hash” determines that same-server routing is preferable - even though you’d never realize a discount.
I haven’t tested this particular case of breaking common patterns with prompt_cache_key to ensure load distribution (vs just making the calls with no caching thought), to benchmark and see speed differences quantitatively in batches of runs, but it seems logical that varying the parameter ensures you don’t concentrate your computation in one place, to bog down inference if OpenAI’s “rollover” is naive.
I did note early on that inputs that could be cacheable, but were below the discount threshold, had a similar performance penalty (or advantage?) to discount-sized calls. My inference then was not about the routing, but that OpenAI might enjoy cache optimization and simply not share the discount - and had similar conclusion before the discount was even introduced.
![]()