We’ve seen some inconsistent caching behavior and wanted to ask questions to better understand, especially regarding tool caching. Combed through past posts and haven’t seen answers to these.
-
When sending both a prompt and a tools list, what is cached first the prompt or the tools list? Is there any way to influence this? For example it would be great to cache the system prompt first (never changes) and the tools list second (most of it doesn’t change but sometimes the last tools might vary) and user prompt last (always changes).
-
Anthropic and Google allow more fine grained controls on what is cached with cache_start and cache_end tokens in the prompt. Is this on OAI’s roadmap?
-
The
prompt_cache_key, if I am reading the caching documentation correctly, is appended first and then the prompt. So the routing is based on the first 256 tokens of prompt_cache_key+prompt (or prompt_cache_key+tools, depending on the answer to question 1 above yes?) -
Cache hits are in increments of 256 starting at 1024. So does this mean that the 256 token prefix is first used to route to a machine. Then the machine must have stored hashes of every prompt during the last 5-10 minutes at 256 token increments. The cache is then queried for the first 1024 key, if that hits then it’s queried for the hash of the first 1280 tokens, etc until it misses? The greatest hit is then used? If not then how does it assure the greatest match in increments of 1024+256 gets used?
-
Per the documentation routing is always to a machine based on the first 256 token hash. Does this literally mean that just one machine stores cash for all prefixes starting with those 256 tokens? The cache isn’t shared at any level between pools of machines?
Thanks!!!