How is prompt_cache_key actually used in API calls?

badrobot · September 12, 2025, 12:22pm

For the life of me, I can’t seem to understand how the key is used. As per the API reference documentation, prompt_cache_key can be passed in the request body of /v1/responses. The following is my request body I fire from Postman:

{
“model”: “gpt-4.1-nano”,
“prompt_cache_key”: “yolo”
“input”: [
{
“role”: “developer”,
“content”: “Extremely long prompt which easily goes over 1024 tokens”
},
{
“role”: “user”,
“content”: “Really long content coming from user”
}
]}

No matter how many times I call the API with the exact same developer content (and user content), cached tokens are still 0 in the response even though I have added the prompt_cache_key in the request body as you can see above. If I move the key inside the input to developer role, I get an error that the field is unknown.

I can’t understand what is wrong and I can’t find any example that shows how to use it properly. My developer prompt is extremely long and it isn’t getting cached so I really need to leverage that. All help is greatly appreciated

aprendendo.next · September 12, 2025, 7:49pm

Well, it is indeed confusing. But in my experience, there are 2 main factors involved:

It is not deterministic, like sending 2 prompts in a row and the 2nd is 100% cached. It takes from a few seconds to minutes to take effect.
Using a previous_response_id seems to help increase the chances of activating the caching, but it will only be suited for conversations.

Here is the outputs of a real test (no previous_response_id involved):

#1st request - not expected to cache
- Cached: 0/1239 Usage: ResponseUsage(input_tokens=1239, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=2, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=1241)
#2nd request - too fast, might not cache
- Cached: 0/1239 Usage: ResponseUsage(input_tokens=1239, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=2, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=1241)
#3rd request - after a few seconds
- Cached: 0/1239 Usage: ResponseUsage(input_tokens=1239, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=2, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=1241)
...1 minute pause before proceeding...
#4th request - after an extra minute
- Cached: 0/1239 Usage: ResponseUsage(input_tokens=1239, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=2, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=1241)
...1 minute pause before proceeding...
#5th request - after an extra minute
- Cached: 1024/1239 Usage: ResponseUsage(input_tokens=1239, input_tokens_details=InputTokensDetails(cached_tokens=1024), output_tokens=2, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=1241)
Elapsed: 128.48 seconds since the first response.

_j · September 12, 2025, 10:04pm

Written here two days ago:

Essentially: use it, or there is little hope of getting a discount. Plenty of undiscounted gpt-5-mini calls, even running the same input.

It is a top-level API parameter, alongside “model”.

TEST: gpt-4.1-nano, chat completions.

The nonce message was 434 characters long.

input tokens: 1440	output tokens: 9
uncached: 1440	non-reasoning: 9
cached: 0	reasoning: 0

============== RESTART:
The nonce message was 604 characters long.

input tokens: 1440	output tokens: 9
uncached: 160	non-reasoning: 9
cached: 1280	reasoning: 0

The persistence is extremely short when I just tested, not enough time to be typing a serious prompt. (the AI also can’t count the message with 1400 extra characters)

Pawe · September 13, 2025, 11:25pm

+1, it’s very inconsistent even with static prompt and cache key. I have found that for best results you need to “prime” the cache first. For example try sending 50 prompts every 2 sec. About halfway through that you start getting frequent hits, at about 80%. But the frustrating thing is it’s been impossible to figure out the optimal priming formula. Eg. 15 prompts every 10 sec sometimes seem to do it, other times not.

OpenAI, please give us more details how cache works, the docs are too vague. See Prompt caching with tools - API - OpenAI Developer Community for more questions.

Pawe · September 14, 2025, 6:22am

One additional finding: prompt_cache_key works very well with gpt-5 and gpt-5-mini (but not gpt-5-nano and not any of the other models). I ran similar 50 request batches with all models. For older models, prompt_cache_key makes no difference and, after priming the cache, you get about an 80% hit rate. For gpt-5 and gpt-5-mini, the behavior without prompt_cache_key is very similar but with prompt_cache_key

the cache is primed very quickly, often I get a cache hit already on the second query
once primed, the cache hit rate is about 95% !

Definitely seems prompt_cache_key is meant to be used with these two models and helps optimize cache use

Topic		Replies	Views
Understanding "prompt_cache_keys" in query efficiency API prompt , prompt-caching	2	70	September 11, 2025
Prompt caching with multiple agents API	1	830	October 9, 2024
Prompt caching doesn't seem to work regularly API api , prompt-caching	4	334	July 13, 2025
How Prompt caching works? API assistants-api , prompt-caching	17	7778	February 4, 2025
How does Prompt Caching work? Prompting api , prompt-caching	8	5387	October 29, 2024

How is prompt_cache_key actually used in API calls?

TEST: gpt-4.1-nano, chat completions.

Related topics