I have a question, I’m developing with Responses APi and I realize that when using function calling I use too many input tokens only in those JSONs, how do I lower the tokens? How can I play with token caching? Does anyone know about this? Thanks : )
For benefiting of caching, you need a minimum of 1024 input tokens, and subsequent calls that start with the same structure .
An alternative is using batch or flex to save costs .
https://platform.openai.com/docs/guides/prompt-caching#page-top
I was thinking about something, but I don’t know if it makes sense: if I save the context ID and only send the function calls once and then don’t send them? Would they stay cached? Or do I always have to send them the same for them to be cached? thx
To receive a (non-guaranteed) discount, the entire initial input context needs to be the same. The cache is temporary after the recent call, with a lifetime around 10-60 minutes (someday I’ll classify how long is delivered).
If you change initial system message instructions from “you are a helpful ai” to “you are a helpful ai with these user preferences”, then you have broken the caching.
That also applies to the list of internal tools and developer-provided functions that are appended to your system message. Switch tools on and off and you break caching.
The Responses mechanisms for a server conversation state do NOT offer improved caching discount quality, nor do they save you any input token cost. Your benefit is only the fact that you don’t need to send all the user’s chat messages again over the network (but instead, another database of user chat tracking with equal complexity must be made by you anyway).
So, it would be possible to increase the cache chances by building a platform’s monotonal input of system message and tools up to beyond 1200 tokens (or so) for a match guarantee. You could get a 50% - 75% discount on that part that is often reused.
It also means your chat length management of your own can be cache-aware: only truncate and alter if the chat is stale, as it might take 50% truncation or more to reduce the costs below a cache discount opportunity.
You always have to send the input again for it to “hit” the cache.
It’s a little confusing on the terminology side because as I understand it, there is no “cache” from a data-storage or “memory” layer in the middleware (and certainly not at the LLM level).
But there is some kind of “compute/reasoning” cache layer. Thus what is actually stored (and I don’t know what level of internality this is to the LLM vs. to the OAS servers/middleware), but somewhere in the process the actual prediction/weighting/attention layers mechanism is cached thus saving future token processing time on the same input.
Thus:
- You send a prompt with greater than 1024 tokens.
- Depending on model use/endpoint, etc. that could generate a “compute/processing/reasoning cache”, which essentially “stores the previously generated result of processing your input for that set of tokens”.
- You continue the conversation or otherwise duplicate that portion of the context window exactly (even a single additional whitespace/punctuation, etc. will cause you not to hit the cache as I understand it/experience it, it has to be identical).
So to review:
- Cache does not store data or have memory related to your “raw input data” you must include the input exactly as it occurred originally with every subsequent context window call (and the first call over 1024 tokens is what creates the cache)
- that’s called “hitting the cache”
-
If you modify that portion of the context window (and yes, it’s sequential processing/storage), then you won’t hit the cache.
-
If you only extend the context window in a sequential linear fashion, then the previous portions of the context window will likely continue to hit the cache (does it continuing “building?” the cache? I’m not sure.. but I think so?)
In my experience, cache has been a non-relevant topic, it’s overall effect is minimal, unless you are using the highly expensive models for very large calls/context windows. However I’ve actually experienced more detrimental effects than anything else using the cache.
For example I’ll:
(A) Send a context window of documentation/data + (B)
(B) [instructions / tasks for that data]
(C) receive a response I don’t like
(D) I modify the instructions (B) but not the data (A) and resend the conversation/context window [same (A), new (B)]. (A) is greater than 1024 tokens.
(E) oops I hit the cache for A), and now the model still doesn’t really give the results I want, because it’s re-using old compute from cache (A) and doesn’t seem to quite integrate the new instructions from (D) in a proper/complete fashion due to the cache of A)!
(F) I go in and add a couple of “..” or line breaks into (A)
(G) I resend same as (D) but with the additions from (F)
(H) I get a totally different response (much better usually) than I did in step (D), because I intentionally “missed the cache”, therefore allowing the new instructions to be computed alongside the data instead of separately from the cached compute and then via some sort of further attention layer integration…
Disclaimer: I could be wrong about all of this as but it’s the understanding I’ve developed through experience
The cache may trigger if you reuse the same first 1024 tokens across requests. There is no guarantee that the cache will hit even if this condition is met, so best not to rely on it.
If you expect your requests to be under 1024 tokens, or for any of your first 1024 tokens to not be unique, then you won’t receive cache discounts. But if you do receive a discount, there doesn’t seem to be a limit on how much of your prompt can be cached after the initial 1024 tokens.
Docs on prompt caching: https://platform.openai.com/docs/guides/prompt-caching