I’ve been testing prompt caching with the Responses API and noticed prompt_cache_key behaves inconsistently. With GPT-4o, I often get many cached tokens on repeated runs, but with GPT-5-mini I rarely see any, even with identical prompts and the same cache key.
Here’s a minimal example:
class Command(BaseCommand):
def handle(self, *args, **options) -> None:
client = OpenAI(api_key=settings.OPENAI_API_KEY)
large_block = "abcdefghijkl1 " * 2048
input_list: ResponseInputParam = [
{"role": "system", "content": large_block},
]
prompt_cache_key = hashlib.sha256(large_block.encode("utf-8")).hexdigest()[:64]
input_list.append({"role": "user", "content": "Hello"})
cached_tokens_1 = self.call_openai(client, input_list, prompt_cache_key)
print(f"First call cached tokens: {cached_tokens_1}")
time.sleep(2)
input_list.append({"role": "user", "content": "Hello again"})
cached_tokens_2 = self.call_openai(client, input_list, prompt_cache_key)
print(f"Second call cached tokens: {cached_tokens_2}")
def call_openai(
self,
client: OpenAI,
input_list: ResponseInputParam,
prompt_cache_key: str,
) -> int:
accumulated_assistant = ""
cached_tokens = 0
stream = client.responses.create(
model="gpt-4o-2024-11-20",
# model="gpt-5-mini",
input=input_list,
stream=True,
prompt_cache_key=prompt_cache_key,
)
for event in stream:
if isinstance(event, ResponseTextDeltaEvent):
accumulated_assistant += event.delta
elif isinstance(event, ResponseCompletedEvent):
response = event.response
usage = response.usage.model_dump() if response.usage else {}
cached_tokens = usage.get("input_tokens_details", {}).get(
"cached_tokens", 0
)
break
if accumulated_assistant.strip():
input_list.append({"role": "assistant", "content": accumulated_assistant})
return cached_tokens
Example (GPT-4o):
First call cached tokens: 0
Second call cached tokens: 10112
If I run it again, most of the time I get full cached tokens on both calls.
However, when running this with a GPT-5 model, it’s almost always the same (note that I changed the large_block to get a new prompt_cache_key when switching models – not sure if that matters though):
Example (GPT-5-mini):
First call cached tokens: 0
Second call cached tokens: 0
Is this a known issue? Are GPT-5 models just more inconsistent or have weaker support for caching?
Worth mentioning that it’s still not 100% consistent with GPT-4o; sometimes I also get 0 cached tokens. But on average, GPT-4o gives much better results than the GPT-5 models.
Thanks!