How does Prompt Caching work?

svelidanda · October 25, 2024, 2:48pm

I am using gpt-4o-mini:
Scenario 1: I have more than 5672 (includes system prompts, tools, user messages) tokens and upon sending same request multiple times finding 5432 tokens cached, I would expect all tokens get cached, why is it lesser number of tokens getting cached?
Scenario 2: if I remove a word from token and sent the request subsequently, the cached tokens are reduced more than 1000 tokens compared to its previous request, can someone explain how exactly this prompt caching works ?
Scenario 3: Is there a sequence of tokens picked for caching?
For example say I am sending these (2560 tokens of system prompts, 2200 tokens of tools and 1500 user message tokens) same tokens in multiple requests, can I expect all tokens to be cached or if it is caching lesser number of tokens then, which of system prompts, which of tools tokens and which of user message tokens will be cached ? What is start/end points for each type of these tokens get cached and what are the limitations to cache all tokens? Please explain.

platypus · October 25, 2024, 2:54pm

Hi @svelidanda and welcome to the community!

You may want to look at this thread for more details.

But in short: it’s more complicated than just thinking in terms of text/tokens and it comes down to how KV is cached (part of the attention mechanism that GPT models are composed of).

svelidanda · October 25, 2024, 3:47pm

Thanks @platypus, but it is very limited and high level documentation, not clear exactly how it works, Is there a way we can see cached tokens to make out to understand how it is working?

It is really unpredictable in the scenarios I listed.

lane · October 25, 2024, 4:00pm

I honestly don’t understand ALL of the linked thread on prompt caching either but I provided it to Claude with your questions…

Scenario 1: Fewer cached tokens than total

Caching starts at 1024 tokens and increases in 128-token blocks
Maximum cached tokens will be the largest multiple of 128 that fits your total
Example: With 5672 total tokens, you’ll see 5432 cached (42 blocks of 128 + 1024)

Scenario 2: Large cache drop with small changes

KV cache requires exact prefix matches
Even small changes early in the sequence break prefix matching
System must find next valid cache point after the change, potentially invalidating large sections

Scenario 3: Token sequence priority

Caching prioritizes prefix of input (system prompts → tools → user messages)
Put static content first (system prompts, tools) and variable content last (user messages)
Cache typically lasts 5-10 minutes (up to 1 hour in off-peak times)

platypus · October 25, 2024, 6:10pm

looks like Claude is not that great at math: 42 blocks of 128 + 1024 => 42 * 128 + 1024 = 6400

_j · October 25, 2024, 6:24pm

The most efficient communication is metacode.

find your possible cache increment (made by AI pattern following):

1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096, 4224, 4352, 4480, 4608, 4736, 4864, 4992, 5120, 5248, 5376, 5504, 5632, 5760, 5888, 6016, 6144, 6272, 6400, 6528, 6656, 6784, 6912, 7040, 7168, 7296, 7424, 7552, 7680, 7808, 7936, 8064…

platypus · October 25, 2024, 6:35pm

Yes I completely understand where you are coming from.

In Scenario 3, if you are sending 2560 for system prompt, 2200 for tools and 1500 for user messages, that’s 6260 in total, and you can in the most optimal case expect 1024 + (40 * 128) = 6144 cached tokens. That’s assuming in the subsequent calls you don’t make any changes. Now if you start making changes to the user messages part somewhere in the middle, because KV caching is causal (i.e. only looks at preceding tokens to generate the subsequent ones), then you can be sure that the 2nd half your user messages will be evicted.

svelidanda · October 28, 2024, 3:21pm

Thanks for the replies, it is a bit clear now.

Regarding the token sequence priority in gpt-4o-mini:

Is prefix of input tokens counted in same order
System prompts → Tools → User messages ?
Looks like there are 2 separate caches being used one for system prompts and one for tools?
when changing system or tools prompts in subsequent requests, getting cached token count as zero if there are less than 1024 tokens and 1024 + incremented by 128 if there are more than 1024 tokens.

_j · October 29, 2024, 1:27am

There is no “priority”.

You construct exactly what you send with chat completions, and the order of system message, the tool definitions and response definitions that are placed automatically in the first system message, the conversation history length, the text you inject from RAG, the most recent user message…

The key is: reuse as much identical content from the start without alteration.

For example, putting the time of the latest input into the system message is great to break the cache matching.

Assistants is somewhat out of your control, but should activate any available cache if only adding messages.

There is no inspection of the contents to create individual caches. To the cache, what you send is simply a stream of tokens, and the start of the stream must be identical beyond 1024 tokens to reuse any context window calculation.

Tools you set are part of the first system message. Tool calls and returns, you add to a conversation history at the end.

Topic		Replies	Views
How Prompt caching works? API assistants-api , prompt-caching	17	6549	February 4, 2025
How does th Prompt Caching Prefix Match work? API prompt-caching	1	265	October 22, 2024
Cache not caching more than 1024 tokens (expected: increments of 128 tokens) Bugs prompt-caching	6	210	November 14, 2024
Prompt caching with multiple agents API	1	629	October 9, 2024
Understanding Prompt caching API prompt-caching	0	266	January 2, 2025

How does Prompt Caching work?

Related topics