I am using gpt-4o-mini:
Scenario 1: I have more than 5672 (includes system prompts, tools, user messages) tokens and upon sending same request multiple times finding 5432 tokens cached, I would expect all tokens get cached, why is it lesser number of tokens getting cached?
Scenario 2: if I remove a word from token and sent the request subsequently, the cached tokens are reduced more than 1000 tokens compared to its previous request, can someone explain how exactly this prompt caching works ?
Scenario 3: Is there a sequence of tokens picked for caching?
For example say I am sending these (2560 tokens of system prompts, 2200 tokens of tools and 1500 user message tokens) same tokens in multiple requests, can I expect all tokens to be cached or if it is caching lesser number of tokens then, which of system prompts, which of tools tokens and which of user message tokens will be cached ? What is start/end points for each type of these tokens get cached and what are the limitations to cache all tokens? Please explain.
Hi @svelidanda and welcome to the community!
You may want to look at this thread for more details.
But in short: it’s more complicated than just thinking in terms of text/tokens and it comes down to how KV is cached (part of the attention mechanism that GPT models are composed of).
Thanks @platypus, but it is very limited and high level documentation, not clear exactly how it works, Is there a way we can see cached tokens to make out to understand how it is working?
It is really unpredictable in the scenarios I listed.
I honestly don’t understand ALL of the linked thread on prompt caching either but I provided it to Claude with your questions…
Scenario 1: Fewer cached tokens than total
- Caching starts at 1024 tokens and increases in 128-token blocks
- Maximum cached tokens will be the largest multiple of 128 that fits your total
- Example: With 5672 total tokens, you’ll see 5432 cached (42 blocks of 128 + 1024)
Scenario 2: Large cache drop with small changes
- KV cache requires exact prefix matches
- Even small changes early in the sequence break prefix matching
- System must find next valid cache point after the change, potentially invalidating large sections
Scenario 3: Token sequence priority
- Caching prioritizes prefix of input (system prompts → tools → user messages)
- Put static content first (system prompts, tools) and variable content last (user messages)
- Cache typically lasts 5-10 minutes (up to 1 hour in off-peak times)
looks like Claude is not that great at math: 42 blocks of 128 + 1024 => 42 * 128 + 1024 = 6400
The most efficient communication is metacode.
find your possible cache increment (made by AI pattern following):
1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096, 4224, 4352, 4480, 4608, 4736, 4864, 4992, 5120, 5248, 5376, 5504, 5632, 5760, 5888, 6016, 6144, 6272, 6400, 6528, 6656, 6784, 6912, 7040, 7168, 7296, 7424, 7552, 7680, 7808, 7936, 8064…
Yes I completely understand where you are coming from.
In Scenario 3, if you are sending 2560 for system prompt, 2200 for tools and 1500 for user messages, that’s 6260 in total, and you can in the most optimal case expect 1024 + (40 * 128) = 6144 cached tokens. That’s assuming in the subsequent calls you don’t make any changes. Now if you start making changes to the user messages part somewhere in the middle, because KV caching is causal (i.e. only looks at preceding tokens to generate the subsequent ones), then you can be sure that the 2nd half your user messages will be evicted.
Thanks for the replies, it is a bit clear now.
Regarding the token sequence priority in gpt-4o-mini:
-
Is prefix of input tokens counted in same order
System prompts → Tools → User messages ? -
Looks like there are 2 separate caches being used one for system prompts and one for tools?
when changing system or tools prompts in subsequent requests, getting cached token count as zero if there are less than 1024 tokens and 1024 + incremented by 128 if there are more than 1024 tokens.
There is no “priority”.
You construct exactly what you send with chat completions, and the order of system message, the tool definitions and response definitions that are placed automatically in the first system message, the conversation history length, the text you inject from RAG, the most recent user message…
The key is: reuse as much identical content from the start without alteration.
For example, putting the time of the latest input into the system message is great to break the cache matching.
Assistants is somewhat out of your control, but should activate any available cache if only adding messages.
There is no inspection of the contents to create individual caches. To the cache, what you send is simply a stream of tokens, and the start of the stream must be identical beyond 1024 tokens to reuse any context window calculation.
Tools you set are part of the first system message. Tool calls and returns, you add to a conversation history at the end.