I’ve been seeing lots of complaints about prompt caching for a while.
Many already know the prompt caching guide in the docs. The issue is that it isn’t always working even if these rules are followed.
But since not everyone shows which prompts or code they are using, it is difficult to replicate the issues. Also, I’ve noticed things have been very unstable, methods that worked previously don’t always work anymore, etc.
So, I’ve made a small script to set a minimal standard for tests covering 3 situations:
- Responses API - same input prompt using roles as inputs, since the
instructionsparameter can break caching. - Responses API using
previous_response_id- similar to the first, but reusing a previous prompt to guarantee that the prefix data remains unchanged. - Chat Completions API - in previous tests it proved to work better, but recently I’ve noticed this has also been failing.
The methodology is to send the same prompt repeatedly for each situation, and measure how long it takes until all 3 situations are successfully cached.
Conclusions
I’ve run a few varieties of models, and noticed that using a previous_response_id seems to be effective, but not all prompts can use this method.
The second best was chat completions.
Responses API with no previous_response_id had the worst results.
Here is the python script
# Script to test prompt caching effectiveness
import random
import time
from openai import Omit
import pandas as pd
from tabulate import tabulate
from openai import OpenAI
# replace this with your own settings
#client = OpenAI()
# models to be tested
tests = [
("gpt-4o-mini",None),
("gpt-4.1-mini",None),
("gpt-4.1-nano",None),
("o4-mini","low"),
("gpt-5-nano","minimal"),
("gpt-5-nano","low"),
("gpt-5-mini","minimal"),
("gpt-5-mini","low"),
]
# will add some delayed tries after the first 3
extended_tries = 10
# increase for larger input prompts. 400 ~= 1k tokens
array_size = 400
def dev_prompt_generator(size):
"""a randomly generated large prompt for this test"""
arr = [random.randint(1, 1000) for _ in range(size)]
arr.sort()
developer_prompt = f"For every number the user provides, answer in a single word, how many numbers in the array are lower than it.\nData array: {arr}"
return developer_prompt
def cache_test(model, reasoning, extended_tries=0):
"""Tests a model with 3 different approaches until all of them succeeds or a limit is reached"""
developer_prompt = dev_prompt_generator(array_size)
prompt_cache_key=f"key-{random.randint(1, 10000)-1:04d}"
start = time.time()
previous_id = None
request_count=0
df = pd.DataFrame(columns=["Sequence", "Time elapsed (secs)", "Cache 1(Responses)","Cache 2(Prev ID)", "Cache 3 (C.Completions)", "Model"])
def request(msg,number):
nonlocal request_count, previous_id
request_count+=1
response= client.responses.with_raw_response.create(
model=model,
prompt_cache_key=f"{prompt_cache_key}-a",
input = [{"role": "developer","content": developer_prompt},
{"role": "user", "content": f"{number}"}],
reasoning={"effort": reasoning} if reasoning is not None else None,
).parse()
response_cached = response.usage.input_tokens_details.cached_tokens
if previous_id is None:
previous_id = response.id
response2= client.responses.with_raw_response.create(
model=model,
prompt_cache_key=f"{prompt_cache_key}-b",
previous_response_id=previous_id,
input = [{"role": "user", "content": f"{number}"}],
reasoning={"effort": reasoning} if reasoning is not None else None,
).parse()
response2_cached = response2.usage.input_tokens_details.cached_tokens
completion = client.chat.completions.with_raw_response.parse(
model=model,
reasoning_effort=reasoning if reasoning is not None else Omit(),
messages=[{"role": "developer", "content": developer_prompt},
{"role": "user", "content": f"{number}"}],
prompt_cache_key=f"{prompt_cache_key}-c",
).parse()
completion_cached = completion.usage.prompt_tokens_details.cached_tokens
elapsed = time.time() - start
print(f" #{request_count} ({model}) - {elapsed:.2f}s - {msg} - ",
f"- Responses API : {response_cached}/{response.usage.input_tokens}",
f"- w/ Previous ID : {response2_cached}/{response2.usage.input_tokens}",
f"- Chat Completions: {completion_cached}/{completion.usage.prompt_tokens}")
df.loc[len(df)] = [request_count, elapsed, response_cached, response2_cached, completion_cached, f"{model} {reasoning if reasoning is not None else ''}"]
# print(tabulate(df,headers="keys",tablefmt="grid",showindex=False,) )
return response_cached>0 and response2_cached and completion_cached>0, response, response2, completion
print(f'## Testing model: {model} {reasoning}')
# print(f"#Developer prompt: {developer_prompt}\n")
success, response, response2, completion = request("1st request - not expected to cache",random.randint(1, 1000))
success, response, response2, completion = request("2nd request - too fast, might not cache",random.randint(1, 1000))
time.sleep(5.0)
success, response, response2, completion = request("3rd request - after a few seconds",random.randint(1, 1000))
# how many delayed tries to attempt
if extended_tries:
for trial in range(4,4+extended_tries):
if not success:
print('...waiting a minute before proceeding...')
time.sleep(60.0)
success, response, response2, completion = request(f"{trial}th request - after an extra minute",random.randint(1, 1000))
else:
break
print(f"-- Test finished. Time elapsed: {time.time() - start} - {"Caching succeeded" if success else "Caching failed"}\n\n")
return df, success
results = []
for test in tests:
model, reasoning = test
try:
df, success = cache_test(model, reasoning, extended_tries=extended_tries)
results.append((df,success))
except Exception as e:
print(f"Error: ", e)
dfs = pd.concat([r[0] for r in results], ignore_index=True)
print("# Dataset")
dfs["success"] = dfs[["Cache 1(Responses)","Cache 2(Prev ID)","Cache 3 (C.Completions)"]].gt(0).all(axis=1)
print(tabulate(dfs,headers="keys",tablefmt="grid",showindex=False,) )
latest = dfs.loc[dfs.groupby("Model")["Sequence"].idxmax()]
print("\n\n# Final results")
print(tabulate(latest,headers="keys",tablefmt="grid",showindex=False,) )
Here are the detailed results.
| Sequence | Time elapsed (secs) | Cache 1 (Responses) | Cache 2 (Prev ID) | Cache 3 (C.Completions) | Model | success |
|---|---|---|---|---|---|---|
| 1 | 2.38748 | 0 | 1024 | 0 | gpt-4o-mini | False |
| 2 | 5.64998 | 0 | 1024 | 0 | gpt-4o-mini | False |
| 3 | 13.1289 | 0 | 0 | 1152 | gpt-4o-mini | False |
| 4 | 76.2975 | 0 | 0 | 1152 | gpt-4o-mini | False |
| 5 | 140.881 | 1024 | 0 | 1152 | gpt-4o-mini | False |
| 6 | 202.748 | 0 | 1024 | 0 | gpt-4o-mini | False |
| 7 | 265.758 | 1024 | 0 | 1152 | gpt-4o-mini | False |
| 8 | 328.737 | 0 | 1024 | 1152 | gpt-4o-mini | False |
| 9 | 391.245 | 0 | 1024 | 0 | gpt-4o-mini | False |
| 10 | 454.68 | 0 | 0 | 1152 | gpt-4o-mini | False |
| 11 | 517.908 | 0 | 1024 | 1152 | gpt-4o-mini | False |
| 12 | 580.95 | 0 | 1024 | 0 | gpt-4o-mini | False |
| 13 | 644.254 | 0 | 1024 | 0 | gpt-4o-mini | False |
| 1 | 1.88047 | 0 | 1024 | 0 | gpt-4.1-mini | False |
| 2 | 4.20265 | 0 | 1024 | 1152 | gpt-4.1-mini | False |
| 3 | 11.2398 | 0 | 1024 | 1152 | gpt-4.1-mini | False |
| 4 | 75.1485 | 1024 | 1024 | 0 | gpt-4.1-mini | False |
| 5 | 138.369 | 0 | 1024 | 1152 | gpt-4.1-mini | False |
| 6 | 201.734 | 0 | 1024 | 1152 | gpt-4.1-mini | False |
| 7 | 265.003 | 0 | 1024 | 0 | gpt-4.1-mini | False |
| 8 | 328.288 | 0 | 1024 | 1152 | gpt-4.1-mini | False |
| 9 | 390.318 | 0 | 1024 | 0 | gpt-4.1-mini | False |
| 10 | 453.332 | 1024 | 1024 | 1152 | gpt-4.1-mini | True |
| 1 | 2.33952 | 0 | 0 | 0 | gpt-4.1-nano | False |
| 2 | 6.04066 | 0 | 1024 | 1024 | gpt-4.1-nano | False |
| 3 | 13.7015 | 0 | 1024 | 1024 | gpt-4.1-nano | False |
| 4 | 78.0247 | 0 | 1024 | 1024 | gpt-4.1-nano | False |
| 5 | 141.046 | 0 | 1024 | 0 | gpt-4.1-nano | False |
| 6 | 204.062 | 1024 | 1024 | 1024 | gpt-4.1-nano | True |
| 1 | 13.9964 | 0 | 1152 | 0 | o4-mini low | False |
| 2 | 71.1219 | 1152 | 1152 | 1152 | o4-mini low | True |
| 3 | 90.2334 | 0 | 1152 | 1152 | o4-mini low | False |
| 4 | 195.242 | 0 | 1152 | 1152 | o4-mini low | False |
| 5 | 268.314 | 0 | 1152 | 1152 | o4-mini low | False |
| 6 | 370.723 | 0 | 1152 | 1152 | o4-mini low | False |
| 7 | 444.106 | 1152 | 1152 | 1152 | o4-mini low | True |
| 1 | 3.16477 | 0 | 0 | 0 | gpt-5-nano minimal | False |
| 2 | 8.68408 | 0 | 0 | 1152 | gpt-5-nano minimal | False |
| 3 | 17.0276 | 1152 | 1152 | 1152 | gpt-5-nano minimal | True |
| 1 | 20.8009 | 0 | 0 | 0 | gpt-5-nano low | False |
| 2 | 44.8654 | 0 | 0 | 0 | gpt-5-nano low | False |
| 3 | 79.3201 | 0 | 0 | 0 | gpt-5-nano low | False |
| 4 | 179.357 | 0 | 0 | 0 | gpt-5-nano low | False |
| 5 | 265.135 | 0 | 0 | 0 | gpt-5-nano low | False |
| 6 | 344.083 | 0 | 0 | 0 | gpt-5-nano low | False |
| 7 | 423.603 | 0 | 0 | 0 | gpt-5-nano low | False |
| 8 | 514.556 | 0 | 0 | 0 | gpt-5-nano low | False |
| 9 | 607.148 | 0 | 0 | 0 | gpt-5-nano low | False |
| 10 | 705.163 | 0 | 0 | 0 | gpt-5-nano low | False |
| 11 | 800.331 | 0 | 0 | 0 | gpt-5-nano low | False |
| 12 | 880.194 | 0 | 0 | 0 | gpt-5-nano low | False |
| 13 | 993.887 | 0 | 0 | 0 | gpt-5-nano low | False |
| 1 | 4.42109 | 0 | 1152 | 0 | gpt-5-mini minimal | False |
| 2 | 8.99814 | 0 | 1152 | 0 | gpt-5-mini minimal | False |
| 3 | 20.7642 | 0 | 1152 | 1152 | gpt-5-mini minimal | False |
| 4 | 85.043 | 0 | 1152 | 1152 | gpt-5-mini minimal | False |
| 5 | 150.378 | 0 | 1152 | 1152 | gpt-5-mini minimal | False |
| 6 | 214.696 | 0 | 1152 | 1152 | gpt-5-mini minimal | False |
| 7 | 279.243 | 1152 | 1152 | 1152 | gpt-5-mini minimal | True |
| 1 | 30.7662 | 0 | 0 | 0 | gpt-5-mini low | False |
| 2 | 59.0795 | 0 | 1152 | 0 | gpt-5-mini low | False |
| 3 | 105.617 | 0 | 0 | 0 | gpt-5-mini low | False |
| 4 | 194.295 | 0 | 0 | 0 | gpt-5-mini low | False |
| 5 | 330.704 | 0 | 0 | 0 | gpt-5-mini low | False |
| 6 | 417.432 | 0 | 0 | 0 | gpt-5-mini low | False |
| 7 | 532.9 | 0 | 0 | 0 | gpt-5-mini low | False |
| 8 | 630.575 | 0 | 0 | 0 | gpt-5-mini low | False |
| 9 | 732.305 | 0 | 0 | 0 | gpt-5-mini low | False |
| 10 | 820.879 | 0 | 0 | 0 | gpt-5-mini low | False |
| 11 | 907.207 | 0 | 0 | 0 | gpt-5-mini low | False |
| 12 | 999.845 | 0 | 0 | 0 | gpt-5-mini low | False |
| 13 | 1105.79 | 0 | 0 | 0 | gpt-5-mini low | False |
And here is the filtered report, with only the successful attempts for each model.
In the final report, it can be observed how many attempts took (sequence column) to reach successful caching with all 3 methods, the time it took, or if caching failed even after about 13 attempts.
| Sequence | Time elapsed (secs) | Cache 1 (Responses) | Cache 2 (Prev ID) | Cache 3 (C.Completions) | Model | success |
|---|---|---|---|---|---|---|
| 10 | 453.332 | 1024 | 1024 | 1152 | gpt-4.1-mini | True |
| 6 | 204.062 | 1024 | 1024 | 1024 | gpt-4.1-nano | True |
| 13 | 644.254 | 0 | 1024 | 0 | gpt-4o-mini | False |
| 13 | 1105.79 | 0 | 0 | 0 | gpt-5-mini low | False |
| 7 | 279.243 | 1152 | 1152 | 1152 | gpt-5-mini minimal | True |
| 13 | 993.887 | 0 | 0 | 0 | gpt-5-nano low | False |
| 3 | 17.0276 | 1152 | 1152 | 1152 | gpt-5-nano minimal | True |
| 7 | 444.106 | 1152 | 1152 | 1152 | o4-mini low | True |
@OpenAI_Support
I hope this can be used as a reference to monitor prompt caching deterioration and also, it would be great if someone from openai staff could look further into it.