We need to talk about prompt caching

aprendendo.next · October 25, 2025, 4:07pm

I’ve been seeing lots of complaints about prompt caching for a while.

Many already know the prompt caching guide in the docs. The issue is that it isn’t always working even if these rules are followed.

But since not everyone shows which prompts or code they are using, it is difficult to replicate the issues. Also, I’ve noticed things have been very unstable, methods that worked previously don’t always work anymore, etc.

So, I’ve made a small script to set a minimal standard for tests covering 3 situations:

Responses API - same input prompt using roles as inputs, since the instructions parameter can break caching.
Responses API using previous_response_id - similar to the first, but reusing a previous prompt to guarantee that the prefix data remains unchanged.
Chat Completions API - in previous tests it proved to work better, but recently I’ve noticed this has also been failing.

The methodology is to send the same prompt repeatedly for each situation, and measure how long it takes until all 3 situations are successfully cached.

Conclusions
I’ve run a few varieties of models, and noticed that using a previous_response_id seems to be effective, but not all prompts can use this method.
The second best was chat completions.
Responses API with no previous_response_id had the worst results.

Here is the python script

# Script to test prompt caching effectiveness

import random
import time
from openai import Omit
import pandas as pd
from tabulate import tabulate
from openai import OpenAI

# replace this with your own settings
#client = OpenAI()

# models to be tested
tests = [
    ("gpt-4o-mini",None),
    ("gpt-4.1-mini",None),
    ("gpt-4.1-nano",None),
    ("o4-mini","low"),
    ("gpt-5-nano","minimal"),
    ("gpt-5-nano","low"),
    ("gpt-5-mini","minimal"),
    ("gpt-5-mini","low"),
]
# will add some delayed tries after the first 3
extended_tries = 10
# increase for larger input prompts. 400 ~= 1k tokens
array_size = 400

def dev_prompt_generator(size):
    """a randomly generated large prompt for this test"""
    arr = [random.randint(1, 1000) for _ in range(size)]
    arr.sort()
    developer_prompt = f"For every number the user provides, answer in a single word, how many numbers in the array are lower than it.\nData array: {arr}"
    return developer_prompt

def cache_test(model, reasoning, extended_tries=0):
    """Tests a model with 3 different approaches until all of them succeeds or a limit is reached"""
    developer_prompt = dev_prompt_generator(array_size)
    prompt_cache_key=f"key-{random.randint(1, 10000)-1:04d}"
    start = time.time()
    previous_id = None
    request_count=0
    df = pd.DataFrame(columns=["Sequence", "Time elapsed (secs)", "Cache 1(Responses)","Cache 2(Prev ID)", "Cache 3 (C.Completions)", "Model"])

    def request(msg,number):
        nonlocal request_count, previous_id
        request_count+=1
        
        response= client.responses.with_raw_response.create(
            model=model, 
            prompt_cache_key=f"{prompt_cache_key}-a",
            input = [{"role": "developer","content": developer_prompt},
                    {"role": "user", "content": f"{number}"}],
            reasoning={"effort": reasoning} if reasoning is not None else None,
        ).parse()
        response_cached = response.usage.input_tokens_details.cached_tokens

        if previous_id is None:
            previous_id = response.id

        response2= client.responses.with_raw_response.create(
            model=model, 
            prompt_cache_key=f"{prompt_cache_key}-b",
            previous_response_id=previous_id,
            input = [{"role": "user", "content": f"{number}"}],
            reasoning={"effort": reasoning} if reasoning is not None else None,
        ).parse()
        response2_cached = response2.usage.input_tokens_details.cached_tokens
        
        completion = client.chat.completions.with_raw_response.parse(
            model=model,
            reasoning_effort=reasoning if reasoning is not None else Omit(),
            messages=[{"role": "developer", "content": developer_prompt},
                {"role": "user", "content": f"{number}"}],
            prompt_cache_key=f"{prompt_cache_key}-c",
        ).parse()
        completion_cached = completion.usage.prompt_tokens_details.cached_tokens
        elapsed = time.time() - start
        print(f" #{request_count} ({model}) - {elapsed:.2f}s - {msg} - ",
              f"- Responses API   : {response_cached}/{response.usage.input_tokens}",
              f"- w/ Previous ID  : {response2_cached}/{response2.usage.input_tokens}",
              f"- Chat Completions: {completion_cached}/{completion.usage.prompt_tokens}")
        df.loc[len(df)] = [request_count, elapsed, response_cached, response2_cached, completion_cached, f"{model} {reasoning if reasoning is not None else ''}"]
        # print(tabulate(df,headers="keys",tablefmt="grid",showindex=False,) )
        
        return response_cached>0 and response2_cached and completion_cached>0, response, response2, completion

    print(f'## Testing model: {model} {reasoning}')
    # print(f"#Developer prompt: {developer_prompt}\n")
    success, response, response2, completion = request("1st request - not expected to cache",random.randint(1, 1000))
    success, response, response2, completion = request("2nd request - too fast, might not cache",random.randint(1, 1000))
    time.sleep(5.0)
    success, response, response2, completion = request("3rd request - after a few seconds",random.randint(1, 1000))
    # how many delayed tries to attempt
    if extended_tries:
        for trial in range(4,4+extended_tries):
            if not success:
                print('...waiting a minute before proceeding...')
                time.sleep(60.0)
                success, response, response2, completion = request(f"{trial}th request - after an extra minute",random.randint(1, 1000))
            else:
                break
    print(f"-- Test finished. Time elapsed: {time.time() - start} - {"Caching succeeded" if success else "Caching failed"}\n\n")
    return df, success

results = []
for test in tests:
    model, reasoning = test
    try:
        df, success = cache_test(model, reasoning, extended_tries=extended_tries)
        results.append((df,success))
    except Exception as e:
        print(f"Error: ", e)

dfs = pd.concat([r[0] for r in results], ignore_index=True)

print("# Dataset")
dfs["success"] = dfs[["Cache 1(Responses)","Cache 2(Prev ID)","Cache 3 (C.Completions)"]].gt(0).all(axis=1)
print(tabulate(dfs,headers="keys",tablefmt="grid",showindex=False,) )
latest = dfs.loc[dfs.groupby("Model")["Sequence"].idxmax()]

print("\n\n# Final results")
print(tabulate(latest,headers="keys",tablefmt="grid",showindex=False,) )

Here are the detailed results.

Sequence	Time elapsed (secs)	Cache 1 (Responses)	Cache 2 (Prev ID)	Cache 3 (C.Completions)	Model	success
1	2.38748	0	1024	0	gpt-4o-mini	False
2	5.64998	0	1024	0	gpt-4o-mini	False
3	13.1289	0	0	1152	gpt-4o-mini	False
4	76.2975	0	0	1152	gpt-4o-mini	False
5	140.881	1024	0	1152	gpt-4o-mini	False
6	202.748	0	1024	0	gpt-4o-mini	False
7	265.758	1024	0	1152	gpt-4o-mini	False
8	328.737	0	1024	1152	gpt-4o-mini	False
9	391.245	0	1024	0	gpt-4o-mini	False
10	454.68	0	0	1152	gpt-4o-mini	False
11	517.908	0	1024	1152	gpt-4o-mini	False
12	580.95	0	1024	0	gpt-4o-mini	False
13	644.254	0	1024	0	gpt-4o-mini	False
1	1.88047	0	1024	0	gpt-4.1-mini	False
2	4.20265	0	1024	1152	gpt-4.1-mini	False
3	11.2398	0	1024	1152	gpt-4.1-mini	False
4	75.1485	1024	1024	0	gpt-4.1-mini	False
5	138.369	0	1024	1152	gpt-4.1-mini	False
6	201.734	0	1024	1152	gpt-4.1-mini	False
7	265.003	0	1024	0	gpt-4.1-mini	False
8	328.288	0	1024	1152	gpt-4.1-mini	False
9	390.318	0	1024	0	gpt-4.1-mini	False
10	453.332	1024	1024	1152	gpt-4.1-mini	True
1	2.33952	0	0	0	gpt-4.1-nano	False
2	6.04066	0	1024	1024	gpt-4.1-nano	False
3	13.7015	0	1024	1024	gpt-4.1-nano	False
4	78.0247	0	1024	1024	gpt-4.1-nano	False
5	141.046	0	1024	0	gpt-4.1-nano	False
6	204.062	1024	1024	1024	gpt-4.1-nano	True
1	13.9964	0	1152	0	o4-mini low	False
2	71.1219	1152	1152	1152	o4-mini low	True
3	90.2334	0	1152	1152	o4-mini low	False
4	195.242	0	1152	1152	o4-mini low	False
5	268.314	0	1152	1152	o4-mini low	False
6	370.723	0	1152	1152	o4-mini low	False
7	444.106	1152	1152	1152	o4-mini low	True
1	3.16477	0	0	0	gpt-5-nano minimal	False
2	8.68408	0	0	1152	gpt-5-nano minimal	False
3	17.0276	1152	1152	1152	gpt-5-nano minimal	True
1	20.8009	0	0	0	gpt-5-nano low	False
2	44.8654	0	0	0	gpt-5-nano low	False
3	79.3201	0	0	0	gpt-5-nano low	False
4	179.357	0	0	0	gpt-5-nano low	False
5	265.135	0	0	0	gpt-5-nano low	False
6	344.083	0	0	0	gpt-5-nano low	False
7	423.603	0	0	0	gpt-5-nano low	False
8	514.556	0	0	0	gpt-5-nano low	False
9	607.148	0	0	0	gpt-5-nano low	False
10	705.163	0	0	0	gpt-5-nano low	False
11	800.331	0	0	0	gpt-5-nano low	False
12	880.194	0	0	0	gpt-5-nano low	False
13	993.887	0	0	0	gpt-5-nano low	False
1	4.42109	0	1152	0	gpt-5-mini minimal	False
2	8.99814	0	1152	0	gpt-5-mini minimal	False
3	20.7642	0	1152	1152	gpt-5-mini minimal	False
4	85.043	0	1152	1152	gpt-5-mini minimal	False
5	150.378	0	1152	1152	gpt-5-mini minimal	False
6	214.696	0	1152	1152	gpt-5-mini minimal	False
7	279.243	1152	1152	1152	gpt-5-mini minimal	True
1	30.7662	0	0	0	gpt-5-mini low	False
2	59.0795	0	1152	0	gpt-5-mini low	False
3	105.617	0	0	0	gpt-5-mini low	False
4	194.295	0	0	0	gpt-5-mini low	False
5	330.704	0	0	0	gpt-5-mini low	False
6	417.432	0	0	0	gpt-5-mini low	False
7	532.9	0	0	0	gpt-5-mini low	False
8	630.575	0	0	0	gpt-5-mini low	False
9	732.305	0	0	0	gpt-5-mini low	False
10	820.879	0	0	0	gpt-5-mini low	False
11	907.207	0	0	0	gpt-5-mini low	False
12	999.845	0	0	0	gpt-5-mini low	False
13	1105.79	0	0	0	gpt-5-mini low	False

And here is the filtered report, with only the successful attempts for each model.

In the final report, it can be observed how many attempts took (sequence column) to reach successful caching with all 3 methods, the time it took, or if caching failed even after about 13 attempts.

Sequence	Time elapsed (secs)	Cache 1 (Responses)	Cache 2 (Prev ID)	Cache 3 (C.Completions)	Model	success
10	453.332	1024	1024	1152	gpt-4.1-mini	True
6	204.062	1024	1024	1024	gpt-4.1-nano	True
13	644.254	0	1024	0	gpt-4o-mini	False
13	1105.79	0	0	0	gpt-5-mini low	False
7	279.243	1152	1152	1152	gpt-5-mini minimal	True
13	993.887	0	0	0	gpt-5-nano low	False
3	17.0276	1152	1152	1152	gpt-5-nano minimal	True
7	444.106	1152	1152	1152	o4-mini low	True

@OpenAI_Support
I hope this can be used as a reference to monitor prompt caching deterioration and also, it would be great if someone from openai staff could look further into it.

_j · October 25, 2025, 4:50pm

Did you mean to leave “client =” commented-out in the code?

I have similar, both with a “warm-up” call that is followed by a delay, to assure all is done to establish a server-side cache, plus then cycling through all the models sequentially and slowly. Code base has lots of statistics and report collection code intermingled of just tweaking what it needs to do from a multipurpose base into “cache test” instead of “non-cached varying speed test”. And no “import openai”.

Running identity input at twice the length needed, and getting no cache discount delivered on gpt-5-mini and gpt-5-nano on chat completions, ever.

Topic		Replies	Views
Cache not caching more than 1024 tokens (expected: increments of 128 tokens) Bugs prompt-caching	6	452	November 14, 2024
Prompt Caching Not Working for GPT-5.4-Nano Bugs api	1	142	May 24, 2026
Caching is borked for GPT 5 models Prompting gpt5 , prompt-caching , cache	19	2861	January 8, 2026
Prompt_cache_key seems inconsistent -- works better on GPT-4o than GPT-5 API api	0	229	October 13, 2025
OpenAI Why Are The API Calls So Slow? When will it be fixed? API	102	57724	January 20, 2024

We need to talk about prompt caching

Related topics