If I want to cache only the system prompt, how is it possible to do this by only providing the system prompt as input?
If we are using a transformer model, the input should consist of all characters, and it shouldn’t be just the system prompt. Therefore, I find it hard to understand how meaningful output can be generated when only the system prompt is inputted.
In this context, what does caching mean? Specifically, what exactly is being saved during the caching process?
Additionally, considering that LLMs predict one character at a time recursively, how does the concept of caching apply in this scenario? What exactly is being stored in the cache during this process?
I would appreciate any insights or clarifications on this matter. Thank you!
Here’s the key: caching works by storing exact prefixes of your input. So, if your system message is consistent (let’s say it’s the same instructions or context), you’re in luck. But it won’t cache only the system prompt alone. Caching kicks in when your entire input (system prompt + user input) reaches 1024 tokens or more. So, think of it as caching the start of your prompt – which will likely be the system prompt initially.
To maximize caching benefits, you’ll want to structure your prompt like this: put static content (like the system prompt) first, and dynamic content (like the user’s input) afterward. This makes it more likely for that static system prompt to be cached across multiple requests.
Caching stores the prompt prefix, meaning the first part of your entire prompt—everything leading up to the part that changes (like a user’s input). The system looks for identical prefixes from previous prompts to match and reuse them. So, it’s not just caching the system prompt by itself, but rather the combination of all messages (system, user, etc.) up to a certain token length.
LLMs do indeed predict tokens one by one (not characters, but tokens), but prompt caching doesn’t alter that process. When there’s a cache hit (i.e., a prefix match), the cached tokens aren’t reprocessed—they’re reused, which reduces latency. The model still generates the rest of the tokens based on the user input that follows. The caching just cuts down on the time it takes to process the repeated parts of your input.
What’s stored is the prompt tokens—essentially, the tokenized version of your input (system + user + assistant messages). This includes the messages array (system prompts including tools, user inputs, assistant responses), any images included in the prompt. The stored prefixes stick around for about 5-10 minutes, and during off-peak times, maybe up to an hour.
tl;dr:
Caching saves tokenized prefixes of prompts that are 1024 tokens or longer.
It’s useful when you have repetitive content (like the same system prompt) across multiple requests.
It helps reduce processing time, but the actual output tokens are always generated fresh each time.
Thank you for your answer.
I do not still understand
In a transformer model,
when caching only the system prompt from the input system prompt + user input A,
if a new input comes in as system prompt + user input B,
the input should be:
system prompt + user input B.
Since system prompt + user input A is considered one input,
it becomes impossible to effectively utilize only the system prompt from system prompt + user input A.
What I mean to convey is that,
with caching in a transformer model,
the computational cost for the computer does not change. if my understanding is correct.
Based on @sps response, my understanding is as follows:
When we have a combined input of (System Prompt + user input A), and the total token count reaches or exceeds 1024 tokens, the server-side will begin to cache any components that remain consistent throughout the conversation.
This assumes that user A maintains a consistent dialogue, meaning that the system prompt is sent with every new input, as illustrated below:
Example of Inputs:
SEND: System Prompt + userA_input(1)
RESPONSE: gptResponse(1)
message_history = System Prompt + userA_input(1) + gptResponse(1)
As the conversation history grows, the inputs become increasingly lengthy. It is reasonable to conclude that the server will analyse the repeated elements within this input and begin caching them. Therefore, as the system prompt and prior user inputs are repeated in subsequent messages, the server-side caching is likely to minimize processing costs for users.
I think @sps and @drb_kd explanation is very clear. As long as your system prompt is remain consistent between requests, it will be cached. Actually not only the system prompt will be cached but also user input. Please keep in mind that your system prompt + input should be 1024 tokens or more.
Just do experiment by yourself using Postman or similar tools. You’ll get the response similar to like this:
As I understand, caching does not affect the final output but only stores computed properties to speed up responses and reduce cost. This prevents the system from recomputing sections of the input prompt if it remains unchanged.
Caching occurs with prompts that contain 1024 tokens or more and get hitted in increments of 128 tokens and stores the computed resources, minimizing unnecessary recomputation.
Caching Details:
Caching is available for prompts that contain 1024 tokens or more. Cache hits occur in increments of 128 tokens, so the number of cached tokens for a request will always follow this sequence: 1024, 1152, 1280, 1408, and so on, depending on the prompt’s length.
In the following scenario, the system will not recompute the previously computed “green” section of the prompt, but it will recompute each “yellow” section. Additionally, after at most one hour, it will recompute the green section as well.
Transformer-based language models, such as OpenAI’s GPT series, are powerful tools in natural language processing that can generate coherent and contextually relevant text. These models process and generate text by predicting one token (a piece of text, such as a word or a subword) at a time, based on the tokens that have come before. This process is known as autoregressive decoding. A key component in making this process efficient is the use of a Key-Value (KV) cache, which stores intermediate computations that can be reused to avoid redundant calculations.
1. Tokens and Tokenization
Before a language model can process text, the text must be converted into a format the model can understand:
Tokenization: This is the process of breaking down text into smaller units called tokens. Tokens can be words, subwords, or even individual characters, depending on the tokenization scheme used.
Example: The sentence “Hello, world!” might be tokenized into the tokens [“Hello”, “,”, " world", “!”].
2. Transformer Architecture Overview
Transformers are built on layers that process tokens to understand and generate language:
Embedding Layer: Converts each token into a numerical vector representation, known as an embedding.
Positional Encoding: Adds information about the position of each token in the sequence to the embeddings, since transformers process tokens in parallel and need positional information to understand the sequence order.
Transformer Blocks: Consist of:
Multi-Head Self-Attention Mechanism
Feed-Forward Neural Network
3. Self-Attention Mechanism
The self-attention mechanism allows the model to weigh the importance of different tokens relative to each other:
Queries (Q), Keys (K), and Values (V) are all derived from the embeddings of the input tokens through linear transformations.
Computation Steps:
Compute Queries, Keys, and Values:
For each token embedding ( x ), compute:
( Q = W_q x )
( K = W_k x )
( V = W_v x )
Where ( W_q ), ( W_k ), ( W_v ) are learned weight matrices.
Compute Attention Scores:
Calculate the dot product between the query of the current token and the keys of all tokens: [ \text{Attention Scores} = Q \cdot K^\top ]
Apply Scaling and Softmax:
Scale the attention scores and apply the softmax function to get attention weights: [ \text{Attention Weights} = \text{softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right) ]
( d_k ) is the dimension of the key vectors.
Compute Weighted Sum of Values:
Multiply the attention weights by the values to get the output for the token: [ \text{Output} = \text{Attention Weights} \cdot V ]
4. Autoregressive Decoding Process
In the context of language generation, autoregressive decoding refers to the process where the model generates text one token at a time, with each new token generation conditioned on all previously generated tokens.
Process Overview:
Initialization:
Start with a sequence of input tokens (which could be empty for text generation from scratch).
Token Processing:
For each position in the sequence, compute embeddings, apply positional encoding, and process through transformer blocks.
Prediction:
Use the output of the transformer blocks to predict the next token by applying a linear layer followed by a softmax function to get probabilities over the vocabulary.
Iteration:
Append the predicted token to the sequence and repeat the process to generate subsequent tokens.
Causal Masking:
During training and inference, a causal mask is applied in the self-attention mechanism to prevent the model from “seeing” future tokens. This ensures that the prediction for a token at position ( t ) only depends on tokens at positions ( \leq t ).
5. Key-Value (KV) Cache
When generating text token by token, recomputing the keys and values for all previous tokens at each step is inefficient. The KV cache addresses this inefficiency by storing the keys and values computed for each token, allowing them to be reused in subsequent steps.
Purpose of the KV Cache:
Avoid Redundant Computations: By caching keys and values, the model doesn’t need to recompute them for previous tokens when generating each new token.
How It Works:
Initial Token:
Compute the queries, keys, and values for the first token.
Store the keys and values in the KV cache.
Subsequent Tokens:
For each new token, compute its query, key, and value.
Append the new key and value to the KV cache.
Use the query of the current token and the keys and values from the cache to compute attention outputs.
Implementation Details:
KV Cache Structure:
Typically stored as tensors with dimensions:
Keys: [Number of Layers, Number of Heads, Sequence Length, Key Dimension]
Values: [Number of Layers, Number of Heads, Sequence Length, Value Dimension]
Updating the Cache:
At each time step, the new key and value are added to the cache at the position corresponding to the current sequence length.
6. Capturing and Reusing Computation State
The KV cache effectively captures the state of the model’s computations up to the current point in the sequence. This state can be reused when the initial context (the sequence of tokens provided to the model) remains the same across different runs.
Reusing Computation for Identical Contexts:
Scenario: When generating multiple continuations of the same initial text prompt.
Benefit: By reusing the KV cache for the shared initial context, the model saves computational resources by not recomputing the keys and values for those tokens.
Example:
First Run:
Provide an initial prompt: “Once upon a time”.
Compute and cache the keys and values for these tokens.
Subsequent Runs:
Use the same initial prompt.
Load the previously saved KV cache.
Start generating new tokens without recomputing the initial context.
Efficiency Gains:
Reduced Latency: Speeds up inference since the model skips computations for the shared context.
Resource Optimization: Saves computational power and memory bandwidth.
Key Takeaways
Tokens are the basic units of text processing in language models.
Transformer models use self-attention mechanisms to process tokens, relying on queries, keys, and values.
Autoregressive decoding generates text one token at a time, each time conditioned on all previous tokens.
The KV cache stores the keys and values of previous tokens to avoid redundant computations during inference.
Reusing the KV cache for identical initial contexts saves computational resources and speeds up inference.
Background References for Further Reading about Technology up to GPT-2
Vaswani, A., et al. (2017). Attention Is All You Need. Link
Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI GPT-2 Technical Report. Link, PDF
Brown, T., et al. (2020). Language Models are Few-Shot Learners. Link
Glossary
Token: A unit of text, such as a word or subword, used in language processing.
Embedding: A numerical vector representation of a token.
Self-Attention: A mechanism that allows the model to weigh the importance of different tokens in a sequence relative to each other.
Query (Q), Key (K), Value (V): Vectors derived from token embeddings used in the attention mechanism.
Autoregressive Decoding: Generating text one token at a time, each token conditioned on the previous tokens.
KV Cache: A storage of keys and values computed for tokens, used to speed up inference by reusing computations.
Capturing and Reusing Computation State
When dealing with identical initial contexts between different runs or sequences, the KV cache allows for significant computational savings.
Capturing Computation State:
Identical Contexts: If multiple sequences share the same initial tokens, the keys and values for these tokens will be identical.
State Representation: The KV cache at a particular time step effectively represents the computation state up to that point.
Reusing Computation State:
Cache Sharing: By reusing the KV cache for the identical initial context, the model can skip recomputing the keys and values for those tokens.
Implementation Details:
Cache Storage: The KV cache is typically stored as tensors with dimensions reflecting the number of layers, heads, sequence length, and embedding dimensions.
Thanks to everyone, I finally understand it properly! Since masking ensures future information isn’t included, KV caching allows for arbitrary-length caching and improves computation efficiency!
I think I have understood the gist or the answer thus far.
I’m curious about two things:
Whether issuing in prompt in programming parameter syntax would greatly reduce token burn.
2.Whether using an attachment to upload either
just User input
or
User Input plus prompt
would have any impact on reducing/increasing the tokens required.
I know asking for .docx out definitely increases the tokens burned and does NOT ensure the complete response is provided (eg versus seeing it “type” it out.