Hi @shure.alpha
Great set of questions.
Here’s the key: caching works by storing exact prefixes of your input. So, if your system message is consistent (let’s say it’s the same instructions or context), you’re in luck. But it won’t cache only the system prompt alone. Caching kicks in when your entire input (system prompt + user input) reaches 1024 tokens or more. So, think of it as caching the start of your prompt – which will likely be the system prompt initially.
To maximize caching benefits, you’ll want to structure your prompt like this: put static content (like the system prompt) first, and dynamic content (like the user’s input) afterward. This makes it more likely for that static system prompt to be cached across multiple requests.
Caching stores the prompt prefix, meaning the first part of your entire prompt—everything leading up to the part that changes (like a user’s input). The system looks for identical prefixes from previous prompts to match and reuse them. So, it’s not just caching the system prompt by itself, but rather the combination of all messages (system, user, etc.) up to a certain token length.
LLMs do indeed predict tokens one by one (not characters, but tokens), but prompt caching doesn’t alter that process. When there’s a cache hit (i.e., a prefix match), the cached tokens aren’t reprocessed—they’re reused, which reduces latency. The model still generates the rest of the tokens based on the user input that follows. The caching just cuts down on the time it takes to process the repeated parts of your input.
What’s stored is the prompt tokens—essentially, the tokenized version of your input (system + user + assistant messages). This includes the messages array (system prompts including tools, user inputs, assistant responses), any images included in the prompt. The stored prefixes stick around for about 5-10 minutes, and during off-peak times, maybe up to an hour.
tl;dr:
-
Caching saves tokenized prefixes of prompts that are 1024 tokens or longer.
-
It’s useful when you have repetitive content (like the same system prompt) across multiple requests.
-
It helps reduce processing time, but the actual output tokens are always generated fresh each time.