We need to talk about prompt caching

I’ve been seeing lots of complaints about prompt caching for a while.

Many already know the prompt caching guide in the docs. The issue is that it isn’t always working even if these rules are followed.

But since not everyone shows which prompts or code they are using, it is difficult to replicate the issues. Also, I’ve noticed things have been very unstable, methods that worked previously don’t always work anymore, etc.

So, I’ve made a small script to set a minimal standard for tests covering 3 situations:

  1. Responses API - same input prompt using roles as inputs, since the instructions parameter can break caching.
  2. Responses API using previous_response_id - similar to the first, but reusing a previous prompt to guarantee that the prefix data remains unchanged.
  3. Chat Completions API - in previous tests it proved to work better, but recently I’ve noticed this has also been failing.

The methodology is to send the same prompt repeatedly for each situation, and measure how long it takes until all 3 situations are successfully cached.

Conclusions
I’ve run a few varieties of models, and noticed that using a previous_response_id seems to be effective, but not all prompts can use this method.
The second best was chat completions.
Responses API with no previous_response_id had the worst results.

Here is the python script
# Script to test prompt caching effectiveness

import random
import time
from openai import Omit
import pandas as pd
from tabulate import tabulate
from openai import OpenAI

# replace this with your own settings
#client = OpenAI()

# models to be tested
tests = [
    ("gpt-4o-mini",None),
    ("gpt-4.1-mini",None),
    ("gpt-4.1-nano",None),
    ("o4-mini","low"),
    ("gpt-5-nano","minimal"),
    ("gpt-5-nano","low"),
    ("gpt-5-mini","minimal"),
    ("gpt-5-mini","low"),
]
# will add some delayed tries after the first 3
extended_tries = 10
# increase for larger input prompts. 400 ~= 1k tokens
array_size = 400

def dev_prompt_generator(size):
    """a randomly generated large prompt for this test"""
    arr = [random.randint(1, 1000) for _ in range(size)]
    arr.sort()
    developer_prompt = f"For every number the user provides, answer in a single word, how many numbers in the array are lower than it.\nData array: {arr}"
    return developer_prompt

def cache_test(model, reasoning, extended_tries=0):
    """Tests a model with 3 different approaches until all of them succeeds or a limit is reached"""
    developer_prompt = dev_prompt_generator(array_size)
    prompt_cache_key=f"key-{random.randint(1, 10000)-1:04d}"
    start = time.time()
    previous_id = None
    request_count=0
    df = pd.DataFrame(columns=["Sequence", "Time elapsed (secs)", "Cache 1(Responses)","Cache 2(Prev ID)", "Cache 3 (C.Completions)", "Model"])

    def request(msg,number):
        nonlocal request_count, previous_id
        request_count+=1
        
        response= client.responses.with_raw_response.create(
            model=model, 
            prompt_cache_key=f"{prompt_cache_key}-a",
            input = [{"role": "developer","content": developer_prompt},
                    {"role": "user", "content": f"{number}"}],
            reasoning={"effort": reasoning} if reasoning is not None else None,
        ).parse()
        response_cached = response.usage.input_tokens_details.cached_tokens

        if previous_id is None:
            previous_id = response.id

        response2= client.responses.with_raw_response.create(
            model=model, 
            prompt_cache_key=f"{prompt_cache_key}-b",
            previous_response_id=previous_id,
            input = [{"role": "user", "content": f"{number}"}],
            reasoning={"effort": reasoning} if reasoning is not None else None,
        ).parse()
        response2_cached = response2.usage.input_tokens_details.cached_tokens
        
        completion = client.chat.completions.with_raw_response.parse(
            model=model,
            reasoning_effort=reasoning if reasoning is not None else Omit(),
            messages=[{"role": "developer", "content": developer_prompt},
                {"role": "user", "content": f"{number}"}],
            prompt_cache_key=f"{prompt_cache_key}-c",
        ).parse()
        completion_cached = completion.usage.prompt_tokens_details.cached_tokens
        elapsed = time.time() - start
        print(f" #{request_count} ({model}) - {elapsed:.2f}s - {msg} - ",
              f"- Responses API   : {response_cached}/{response.usage.input_tokens}",
              f"- w/ Previous ID  : {response2_cached}/{response2.usage.input_tokens}",
              f"- Chat Completions: {completion_cached}/{completion.usage.prompt_tokens}")
        df.loc[len(df)] = [request_count, elapsed, response_cached, response2_cached, completion_cached, f"{model} {reasoning if reasoning is not None else ''}"]
        # print(tabulate(df,headers="keys",tablefmt="grid",showindex=False,) )
        
        return response_cached>0 and response2_cached and completion_cached>0, response, response2, completion

    print(f'## Testing model: {model} {reasoning}')
    # print(f"#Developer prompt: {developer_prompt}\n")
    success, response, response2, completion = request("1st request - not expected to cache",random.randint(1, 1000))
    success, response, response2, completion = request("2nd request - too fast, might not cache",random.randint(1, 1000))
    time.sleep(5.0)
    success, response, response2, completion = request("3rd request - after a few seconds",random.randint(1, 1000))
    # how many delayed tries to attempt
    if extended_tries:
        for trial in range(4,4+extended_tries):
            if not success:
                print('...waiting a minute before proceeding...')
                time.sleep(60.0)
                success, response, response2, completion = request(f"{trial}th request - after an extra minute",random.randint(1, 1000))
            else:
                break
    print(f"-- Test finished. Time elapsed: {time.time() - start} - {"Caching succeeded" if success else "Caching failed"}\n\n")
    return df, success

results = []
for test in tests:
    model, reasoning = test
    try:
        df, success = cache_test(model, reasoning, extended_tries=extended_tries)
        results.append((df,success))
    except Exception as e:
        print(f"Error: ", e)

dfs = pd.concat([r[0] for r in results], ignore_index=True)

print("# Dataset")
dfs["success"] = dfs[["Cache 1(Responses)","Cache 2(Prev ID)","Cache 3 (C.Completions)"]].gt(0).all(axis=1)
print(tabulate(dfs,headers="keys",tablefmt="grid",showindex=False,) )
latest = dfs.loc[dfs.groupby("Model")["Sequence"].idxmax()]

print("\n\n# Final results")
print(tabulate(latest,headers="keys",tablefmt="grid",showindex=False,) )


Here are the detailed results.
Sequence Time elapsed (secs) Cache 1 (Responses) Cache 2 (Prev ID) Cache 3 (C.Completions) Model success
1 2.38748 0 1024 0 gpt-4o-mini False
2 5.64998 0 1024 0 gpt-4o-mini False
3 13.1289 0 0 1152 gpt-4o-mini False
4 76.2975 0 0 1152 gpt-4o-mini False
5 140.881 1024 0 1152 gpt-4o-mini False
6 202.748 0 1024 0 gpt-4o-mini False
7 265.758 1024 0 1152 gpt-4o-mini False
8 328.737 0 1024 1152 gpt-4o-mini False
9 391.245 0 1024 0 gpt-4o-mini False
10 454.68 0 0 1152 gpt-4o-mini False
11 517.908 0 1024 1152 gpt-4o-mini False
12 580.95 0 1024 0 gpt-4o-mini False
13 644.254 0 1024 0 gpt-4o-mini False
1 1.88047 0 1024 0 gpt-4.1-mini False
2 4.20265 0 1024 1152 gpt-4.1-mini False
3 11.2398 0 1024 1152 gpt-4.1-mini False
4 75.1485 1024 1024 0 gpt-4.1-mini False
5 138.369 0 1024 1152 gpt-4.1-mini False
6 201.734 0 1024 1152 gpt-4.1-mini False
7 265.003 0 1024 0 gpt-4.1-mini False
8 328.288 0 1024 1152 gpt-4.1-mini False
9 390.318 0 1024 0 gpt-4.1-mini False
10 453.332 1024 1024 1152 gpt-4.1-mini True
1 2.33952 0 0 0 gpt-4.1-nano False
2 6.04066 0 1024 1024 gpt-4.1-nano False
3 13.7015 0 1024 1024 gpt-4.1-nano False
4 78.0247 0 1024 1024 gpt-4.1-nano False
5 141.046 0 1024 0 gpt-4.1-nano False
6 204.062 1024 1024 1024 gpt-4.1-nano True
1 13.9964 0 1152 0 o4-mini low False
2 71.1219 1152 1152 1152 o4-mini low True
3 90.2334 0 1152 1152 o4-mini low False
4 195.242 0 1152 1152 o4-mini low False
5 268.314 0 1152 1152 o4-mini low False
6 370.723 0 1152 1152 o4-mini low False
7 444.106 1152 1152 1152 o4-mini low True
1 3.16477 0 0 0 gpt-5-nano minimal False
2 8.68408 0 0 1152 gpt-5-nano minimal False
3 17.0276 1152 1152 1152 gpt-5-nano minimal True
1 20.8009 0 0 0 gpt-5-nano low False
2 44.8654 0 0 0 gpt-5-nano low False
3 79.3201 0 0 0 gpt-5-nano low False
4 179.357 0 0 0 gpt-5-nano low False
5 265.135 0 0 0 gpt-5-nano low False
6 344.083 0 0 0 gpt-5-nano low False
7 423.603 0 0 0 gpt-5-nano low False
8 514.556 0 0 0 gpt-5-nano low False
9 607.148 0 0 0 gpt-5-nano low False
10 705.163 0 0 0 gpt-5-nano low False
11 800.331 0 0 0 gpt-5-nano low False
12 880.194 0 0 0 gpt-5-nano low False
13 993.887 0 0 0 gpt-5-nano low False
1 4.42109 0 1152 0 gpt-5-mini minimal False
2 8.99814 0 1152 0 gpt-5-mini minimal False
3 20.7642 0 1152 1152 gpt-5-mini minimal False
4 85.043 0 1152 1152 gpt-5-mini minimal False
5 150.378 0 1152 1152 gpt-5-mini minimal False
6 214.696 0 1152 1152 gpt-5-mini minimal False
7 279.243 1152 1152 1152 gpt-5-mini minimal True
1 30.7662 0 0 0 gpt-5-mini low False
2 59.0795 0 1152 0 gpt-5-mini low False
3 105.617 0 0 0 gpt-5-mini low False
4 194.295 0 0 0 gpt-5-mini low False
5 330.704 0 0 0 gpt-5-mini low False
6 417.432 0 0 0 gpt-5-mini low False
7 532.9 0 0 0 gpt-5-mini low False
8 630.575 0 0 0 gpt-5-mini low False
9 732.305 0 0 0 gpt-5-mini low False
10 820.879 0 0 0 gpt-5-mini low False
11 907.207 0 0 0 gpt-5-mini low False
12 999.845 0 0 0 gpt-5-mini low False
13 1105.79 0 0 0 gpt-5-mini low False
And here is the filtered report, with only the successful attempts for each model.

In the final report, it can be observed how many attempts took (sequence column) to reach successful caching with all 3 methods, the time it took, or if caching failed even after about 13 attempts.

Sequence Time elapsed (secs) Cache 1 (Responses) Cache 2 (Prev ID) Cache 3 (C.Completions) Model success
10 453.332 1024 1024 1152 gpt-4.1-mini True
6 204.062 1024 1024 1024 gpt-4.1-nano True
13 644.254 0 1024 0 gpt-4o-mini False
13 1105.79 0 0 0 gpt-5-mini low False
7 279.243 1152 1152 1152 gpt-5-mini minimal True
13 993.887 0 0 0 gpt-5-nano low False
3 17.0276 1152 1152 1152 gpt-5-nano minimal True
7 444.106 1152 1152 1152 o4-mini low True

@OpenAI_Support
I hope this can be used as a reference to monitor prompt caching deterioration and also, it would be great if someone from openai staff could look further into it.

5 Likes

Did you mean to leave “client =” commented-out in the code?

I have similar, both with a “warm-up” call that is followed by a delay, to assure all is done to establish a server-side cache, plus then cycling through all the models sequentially and slowly. Code base has lots of statistics and report collection code intermingled of just tweaking what it needs to do from a multipurpose base into “cache test” instead of “non-cached varying speed test”. And no “import openai”.

Running identity input at twice the length needed, and getting no cache discount delivered on gpt-5-mini and gpt-5-nano on chat completions, ever.

1 Like