Prompt Caching in Classification and Information Retrieval Use Cases

Does prompt caching improve efficiency when using the same system prompt across multiple runs?

In use cases like classification or information retrieval, where the same system prompt is applied throughout an entire dataset, I expected prompt caching to help. To test this, I ran an experiment using the IMDb sentiment analysis dataset.

  # load imdb dataset from huggingface
  from datasets import load_dataset
  dataset = load_dataset("imdb")
  
  system_prompt = """You will be provided with a list of reviews and your task is to analyze the sentiment of the review and classify it as Positive or Negative. DO not add any explanation and the sentiment can by only "Positive" or "Negative".
   json: {"sentiment" : ["Positive/Negative", "Positive/Negative", "Positive/Negative"]}"""
  
  import pandas as pd
  # select only the train and test dataset
  train_dataset = dataset["train"]
  test_dataset = dataset["test"]
  
  # convert it to pandas dataframe
  train_df = train_dataset.to_pandas()
  test_df = test_dataset.to_pandas()
  
  #Merge it into a single dataset
  df = pd.concat([train_df, test_df])
  
  text_list = df.loc[:1000, "text"]
  
  import json
  from tqdm import tqdm
  batch_size = 3
  response_list = []
  
  for start in tqdm(range(0, len(text_list), batch_size)):
      end = start + batch_size
      batch = text_list[start:end]
      chat_completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": json.dumps(batch),
            }
        ],
        response_format={"type": "json_object"}
    )
      print(chat_completion)
      response_list.append(chat_completion)

On the first run, the cached_tokens parameter in the response is always 0, meaning the cache isn’t utilized. However, when I rerun the experiment, I observe cache utilization.

From the documentation, I understand that the first 1024 tokens are cached, and the prompt must be sufficiently similar to take advantage of the cache. Despite using a system prompt with more than 200 tokens, I noticed that the cache isn’t used on a fresh run.

In multi-turn chatbot scenarios, prompt caching seems useful. However, I’m curious: is prompt caching less suited for cases like classification or entity extraction, where system prompts remain constant across the dataset? Would love insights on whether this behavior is expected or if there are optimizations that can be applied.

2 Likes

Hi and welcome to the Forum!

One general points: caching only comes into effect for prompts that are longer than 1024 tokens. From your code it does not look like your (static) system prompt reaches that amount of tokens.

That said, I wanted to clarify your two statements:

Under which circumstances did you observe cache utilization versus not?

1 Like

I have created a reproducible example -

import json
import json
from tqdm import tqdm

! curl -X GET \
     "https://datasets-server.huggingface.co/rows?dataset=wikimedia%2Fwikipedia&config=20231101.en&split=train&offset=0&length=100" -o data.json

# read data.json
with open("data.json", "r") as f:
  data = json.load(f)

ids = []
text_list = []
for row in data["rows"]:
  id = row["row"]["id"]
  text = row["row"]["text"]
  ids.append(id)
  text_list.append(text)

system_prompt = '''
Your task is to assess customers article and categorize customer article into one of the following predfined categories:
'History', 'Geography', 'Science', 'Technology', 'Mathematics', 'Literature', 'Art', 'Music', 'Film', 'Television', 'Sports', 'Politics', 'Philosophy', 'Religion', 'Sociology', 'Psychology', 'Economics', 'Business', 'Medicine', 'Biology', 'Chemistry', 'Physics', 'Astronomy', 'Environmental Science', 'Engineering', 'Computer Science', 'Linguistics', 'Anthropology', 'Archaeology', 'Education', 'Law', 'Military', 'Architecture', 'Fashion', 'Cuisine', 'Travel', 'Mythology', 'Folklore', 'Biography', 'Mythology', 'Social Issues', 'Human Rights', 'Technology Ethics', 'Climate Change', 'Conservation', 'Urban Studies', 'Demographics', 'Journalism', 'Cryptocurrency', 'Artificial Intelligence'
you will output a json object containing the following information:

{
    categories: string[] // category name based on the article,
}

Keep category names simple and use only lower case letters.
Articles can have only one category.
'''
import os
from openai import OpenAI
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Enter your openai key: ")

client = OpenAI()

batch_size = 3
response_list = []

for start in tqdm(range(0, len(text_list), batch_size)):
    end = start + batch_size
    batch = text_list[start:end]
    chat_completion = client.chat.completions.create(
      model="gpt-4o-mini",
      messages=[
          {
              "role": "system",
              "content": system_prompt,
          },
          {
              "role": "user",
              "content": json.dumps(batch),
          }
      ],
      response_format={"type": "json_object"}
  )
    print(chat_completion)
    response_list.append(chat_completion)

When I run it for the first time, the cache_tokens are always 0 (May be because the system prompt is just 238 tokens).

However, If I run the code again, I can see the tokens are used from the cache.

Is it possible to perform any optimization to cache the system prompt ?

I assume that the reason that the tokens are cached is that you likely re-ran the API request with the same user data?

The system prompt itself will not get cached if you include different user data, i.e. data subject to classification, for each API request. It would only get cached if it was at least 1024 tokens in length. In other words, only static content gets cached (whether placed in the system or the user prompt).