Understanding "prompt_cache_keys" in query efficiency

Server routing as described:

  • around 256 tokens + prompt_cache_key field are hashed
  • this is used for best effort to reuse the same server instance
  • if it does not match, load distribution is (should be) encouraged instead
  • one cached host might do 15 calls per minute before you get rolled over to a different non-cache instance anyway (a number that is completely arbitrary and probably fictionalized, because 15 “hi” to gpt-5-nano is not 15 “resolve this QED formula” to o3-pro.)

What it can do is not consume the cache server computation with different calls that start the same.

If you have many users with the same prefix, you have to decide if 1024 tokens of system prompt discount or second turns of chat are more important to discount in common, considering expiry.

3 Likes