We just updated our prompt caching guide with details of how cache routing works!
We route prompts by org, hashing the first ~256 tokens, and spillover to more machines ~15 RPM. If you’ve got many prompts with long shared prefixes, the user parameter can improve request bucketing and boost cache hits
The most important takeaway: “user” can break caching, outside of the input context.
So: were you to have a platform with a common agent used by many user, built to bring it within a cacheable length, sending the user parameter for “user tracking for OpenAI safety” (documented but never seeming to matter), you are breaking your own discount opportunity that you have engineered.
However, if those following tokens, after cacheable start, diverge by different users, the routing will not detect that, so “user” is a benefit that will cache later instead of initially - supposing that user continues interactions in a short period.
A discount does not take effect any earlier - it is all about understanding this mechanism against your own patterns - how you will distribute between servers by your own indirect management instead of blind distribution after a rate is hit.
A stable identifier used to help detect users of your application that may be
violating OpenAI's usage policies. The IDs should be a string that uniquely
identifies each user. We recommend hashing their username or email address, in
order to avoid sending us any identifying information.
Sending these parameters will require updating SDK libraries from openai. Then adapting the purpose to the patterns you make.
Use-case
You have a large cacheable system prompt >1024 tokens, used across many users. (Your support bot with lots of company knowledge, for example).
Solution
Use both new parameters for the described purpose, prompt_cache_key in agreement with the first 256 tokens determining a desire for best effort to route to hash-indicated inference server.
Use-case
You still have a large cacheable system prompt >1024 tokens, used across many users. (Your support bot with lots of company knowledge, for example). But you find out for your high volume greater than 15 requests per minute, you have many cache misses for long chats delivering discount shorter than the chat context window
Solution
Prefer a prompt_cache_key that is user-based, so their conversations and not the common prompt take priority.
Use-case
You have a small system prompt or tools in common, used across users, 256-1023 tokens, after which chats diverge per-user.
Solution
Break the indication of routing to the same server by prompt hash alone. You’d get no discount on only 500 tokens in commmon, but the same “hash” would result. Provide a per-user prompt_cache_key to have your application distributed, but encourage per-user caching of chats.
Avoids the application’s initial 256 context window doing the routing, saturating the cache server but missing, and rolling over.
Use-case
You want to benchmark uncached results, simulating unconnected calls
Solution
hashing of prompt_cache_key and context does not seem to guarantee distribution across different inference servers
thus, add a small variation or crypto nonce that does not distract the AI within the first 128 block (we’d also break any cache OpenAI does without delivering the discount), such as “Chat session id: e3f9\n”
Use-case
You don’t know why sending a user ID would ever be useful, and have heard zero anecdote of it ever being used to mitigate organization bans or inform of a bad user.