Using same prompt_cache_key in multiple parallel conversations

Hi I have several parallel running conversations with long common context at the beginning. I would like to set same prompt_cache_key across these conversations with each of these conversations. So essentially I start of with a conversation and fork those into separate conversations. Is this perfectly acceptable or is there a risk of cross contaminating the cached state across conversations?

Hi, let me try out this new AI thing I hear about recently to answer…

Yep—using the same prompt_cache_key across multiple parallel “forked” conversations is acceptable, and it won’t “cross‑contaminate” the conversations.

Why there’s no cross-contamination

OpenAI’s prompt caching is essentially caching compute for an identical prefix (think: the KV cache / intermediate attention state for the first N tokens of the prompt). If two requests have the same beginning (and meet the caching rules/thresholds), the service can reuse that cached prefix compute and then continue fresh from the point where the prompts diverge.

So:

  • Shared prefix + same routing key ⇒ higher chance of a cache hit for that shared prefix.
  • After the fork, each conversation’s unique continuation is computed independently.
  • There’s no shared “mutable conversation state” being updated that could leak from one fork to another.

When using the same key is a good idea

Use the same prompt_cache_key when:

  • You want many conversations to reuse a big common prefix (large system prompt, shared docs, shared tools block, etc.).
  • You’re intentionally forking from the same starting context and want maximum cache reuse.

When you might not want the same key

Even though it’s safe, it can be suboptimal for performance/cost patterns:

  1. You care more about per-user / per-conversation caching than shared-prefix caching If your chats quickly diverge and are long-lived, a per-user (or per-conversation) prompt_cache_key can improve cache reuse for later turns of each chat rather than optimizing only the shared intro.
  2. High parallel load + identical first ~256 tokens can concentrate routing Routing is best-effort, and the first chunk of tokens + prompt_cache_key can influence which backend instance you land on for cache locality. If you send tons of parallel calls with identical prefixes and the same key, you may end up “hot-spotting” one cache host (and then getting spillover anyway). If you’re load testing or want distribution, vary the key (or add a harmless nonce early in the prompt).

Practical recommendation for forked chats

  • Forked from same base, want maximum shared-prefix reuse: Use the same prompt_cache_key for all forks.
  • Forked but you want each branch to optimize its own subsequent turns: Use prompt_cache_key = base_id + branch_id (or just per conversation/session).
  • If you also track end users for policy / abuse detection: Set safety_identifier to a stable per-end-user ID (hashed), independent of caching.

If you tell me whether your priority is (a) discounting the big shared system/context or (b) discounting each chat’s later turns, I can suggest a simple keying scheme.

“cross-contamination” is merely “over-utilizing the same server, until you are rotated out and have more cache misses”

Since the value you send is hashed along with a small amount of the start of the AI input, and then used for routing to where a cache lives, you won’t want to be changing a “hash” - because it would break finding the cache. You choose a strategy from the outset.

If you get busy with many clients, plus have long conversations where all the users are getting the same system message >1100 tokens, you may wish make this value user-specific.


PS The AI is only powered by my informing and answering previously.

Thanks a lot for quick help. Just so I better understand is the entire message generated by AI or is it part AI, part human? Excuse me I am new to this forum.

Actually I don’t have real conversations, it is kind of tree like structure that I am building where each message is a node forking out into separate child messages and if no child is needed then child returning the nodes output to parent conversation and final output comes for the top node in the conversation. This structure is needed to tackle complex problems where divide and conquer strategy works. So I planning to use same prompt_cache_key across the entire tree so children can retain the context from the parent without recomputing ($$) for initial messages.

The first quoted part with the grey background is output by AI, where I prompted essentially, "here’s all the information about prompt_cache_key, write a final guidance once and for all. Then I gave several text of reference. Then pasted what I got back here.


You will automatically receive cache discount simply when the first part of the input context to an AI model is similar and has been run before to generate the cache that can match more than 1024 tokens (actually more like 1200 needed). If all the inputs start the same, that is you already implying you want the same destination server that might hold a cache from running the input once before. There is no need to “retain context”.

A concern arises when you are sending many inputs in parallel where the “start” looks the same. After about 15 per minute, obviously depending on the run time, the server unit might be fully allocated, and then your request is necessarily run on another instance, making a cache somewhere different, one that grew even more. Then your “chat” still has affinity for the first server, and can revert there on repeated turns for some miss.

Sending a prompt_cache_key allows you the opportunity to break the cache routing intentionally. Since your prompt_cache_key string is hashed with a smaller portion of the start of the input to come up with some algorithm to direct back to a same server that locally holds a cache, sending an original value makes your input even more specific.

For example, I have a “playground” where others can test using their own API key. Since I don’t know how highly utilized their organization is, I add my app’s own string as a prompt_cache_key value. That signals “run elsewhere from production, even if they are using their same prompts”. Chats in that playground then have a high quality of cache hit.

If it is your personal job, and you are just iterating sequentially on variations, you likely don’t need to think too hard and will get a cache hit anyway. The AI output is essentially the same whether it is cheaper or not.