Understanding "prompt_cache_keys" in query efficiency

ravramescu · September 10, 2025, 1:22pm

Initially I wanted the prefix to be cached once and then be referenced by cache_id/prompt_cache_key or something that would eliminate putting the prefix before every query, but it doesn’t seem possible.

I have a prefix greater than 1024 tokens that is being used before all the user queries, and I was looking on how the prompt caching would improve input tokens costing. What impact does it really bring in improving cache hit rates as the latency and cached_tokens I observed with or without “prompt_cache_key” were similar.

_j · September 10, 2025, 2:40pm

Server routing as described:

around 256 tokens + prompt_cache_key field are hashed
this is used for best effort to reuse the same server instance
if it does not match, load distribution is (should be) encouraged instead
one cached host might do 15 calls per minute before you get rolled over to a different non-cache instance anyway (a number that is completely arbitrary and probably fictionalized, because 15 “hi” to gpt-5-nano is not 15 “resolve this QED formula” to o3-pro.)

What it can do is not consume the cache server computation with different calls that start the same.

If you have many users with the same prefix, you have to decide if 1024 tokens of system prompt discount or second turns of chat are more important to discount in common, considering expiry.

ravramescu · September 11, 2025, 8:11am

So, it avoids some internal recomputation improving the cache hit. It’s much clearer now, thanks Jay

sam.saffron · November 12, 2025, 5:17am

Wonderful explanation!

I guess one advantage of environment specific instructions at the beginning of a prompt make it much less likely to even need this in the first place.

Eg: You are an advanced AI bot for community.openai.com …

I assume though that including previous_response_id makes it kind of superfluous in those cases, cause it would be routed to the system that handled previous response anyway? So this is just for the first requests.

_j · November 12, 2025, 5:56am

A cache is local to an instance of a “server” or whatever size of local “compute units” run and share and retain data.

Case against response ID helping cache

A KV context window cache is specific to a particular AI model, whereas you can use previous_response_id with a different AI model in the next turn. This internal state also could be quite large. Therefore: it is unlikely the cache ever becomes part of an account database anywhere to be retrieved.

Others have a cache dB

Google lets you store your own persistent cache by deliberate effort, and specify it by ID. Think: OpenAI’s “prompts presets”, but with a built-in discount.

When to employ the parameter

Besides your focus on caching “app” vs caching “chats”; the big thing that I think is important:

You can penalize your performance if you have 256+ tokens in common with other calls (which is even just the size of the “vision safety message” that OpenAI injects), thus sending everything to the same instance. Without differentiation by changing input, or indeed this parameter to break the routing algorithm, the “hash” determines that same-server routing is preferable - even though you’d never realize a discount.

I haven’t tested this particular case of breaking common patterns with prompt_cache_key to ensure load distribution (vs just making the calls with no caching thought), to benchmark and see speed differences quantitatively in batches of runs, but it seems logical that varying the parameter ensures you don’t concentrate your computation in one place, to bog down inference if OpenAI’s “rollover” is naive.

I did note early on that inputs that could be cacheable, but were below the discount threshold, had a similar performance penalty (or advantage?) to discount-sized calls. My inference then was not about the routing, but that OpenAI might enjoy cache optimization and simply not share the discount - and had similar conclusion before the discount was even introduced.

_j · November 12, 2025, 1:58pm

I do observe something on reflection:

“Prompts” in Responses is a shape that is exactly perfect for database-retrieved context window cache precomputation - without discount delivered.

You choose a fixed model, are encouraged to place all the tools and functions, system message, and multi-shot into one container.

The only thing that is a breaking anti-pattern is prompts’ variables that can be used early in the input. However, there is context cache technology that can break down to matching smaller modular blocks in position-independent cache (web paper). Also prefill-decode disaggregation (PDF) for cross-model cache.

Explains why it is made hard to create the presets as disposable entities, and only in a UI.

Topic		Replies	Views
How is prompt_cache_key actually used in API calls? API	4	1843	September 14, 2025
Using same prompt_cache_key in multiple parallel conversations API	3	60	January 11, 2026
Prompt Cache Routing + the `user` Parameter API prompt-caching	3	559	July 31, 2025
How does Prompt Caching work? Prompting api , prompt-caching	8	7600	October 29, 2024
Prompt Token Cache Gaming to Save Money? API prompt-caching	1	886	October 18, 2024

Understanding "prompt_cache_keys" in query efficiency

Case against response ID helping cache

Others have a cache dB

When to employ the parameter

Related topics