Does prompt caching improve efficiency when using the same system prompt across multiple runs?
In use cases like classification or information retrieval, where the same system prompt is applied throughout an entire dataset, I expected prompt caching to help. To test this, I ran an experiment using the IMDb sentiment analysis dataset.
# load imdb dataset from huggingface
from datasets import load_dataset
dataset = load_dataset("imdb")
system_prompt = """You will be provided with a list of reviews and your task is to analyze the sentiment of the review and classify it as Positive or Negative. DO not add any explanation and the sentiment can by only "Positive" or "Negative".
json: {"sentiment" : ["Positive/Negative", "Positive/Negative", "Positive/Negative"]}"""
import pandas as pd
# select only the train and test dataset
train_dataset = dataset["train"]
test_dataset = dataset["test"]
# convert it to pandas dataframe
train_df = train_dataset.to_pandas()
test_df = test_dataset.to_pandas()
#Merge it into a single dataset
df = pd.concat([train_df, test_df])
text_list = df.loc[:1000, "text"]
import json
from tqdm import tqdm
batch_size = 3
response_list = []
for start in tqdm(range(0, len(text_list), batch_size)):
end = start + batch_size
batch = text_list[start:end]
chat_completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": system_prompt,
},
{
"role": "user",
"content": json.dumps(batch),
}
],
response_format={"type": "json_object"}
)
print(chat_completion)
response_list.append(chat_completion)
On the first run, the cached_tokens parameter in the response is always 0, meaning the cache isn’t utilized. However, when I rerun the experiment, I observe cache utilization.
From the documentation, I understand that the first 1024 tokens are cached, and the prompt must be sufficiently similar to take advantage of the cache. Despite using a system prompt with more than 200 tokens, I noticed that the cache isn’t used on a fresh run.
In multi-turn chatbot scenarios, prompt caching seems useful. However, I’m curious: is prompt caching less suited for cases like classification or entity extraction, where system prompts remain constant across the dataset? Would love insights on whether this behavior is expected or if there are optimizations that can be applied.