How is prompt_cache_key actually used in API calls?

For the life of me, I can’t seem to understand how the key is used. As per the API reference documentation, prompt_cache_key can be passed in the request body of /v1/responses. The following is my request body I fire from Postman:

{
“model”: “gpt-4.1-nano”,
“prompt_cache_key”: “yolo”
“input”: [
{
“role”: “developer”,
“content”: “Extremely long prompt which easily goes over 1024 tokens”
},
{
“role”: “user”,
“content”: “Really long content coming from user”
}
]}

No matter how many times I call the API with the exact same developer content (and user content), cached tokens are still 0 in the response even though I have added the prompt_cache_key in the request body as you can see above. If I move the key inside the input to developer role, I get an error that the field is unknown.

I can’t understand what is wrong and I can’t find any example that shows how to use it properly. My developer prompt is extremely long and it isn’t getting cached so I really need to leverage that. All help is greatly appreciated

Well, it is indeed confusing. But in my experience, there are 2 main factors involved:

  • It is not deterministic, like sending 2 prompts in a row and the 2nd is 100% cached. It takes from a few seconds to minutes to take effect.
  • Using a previous_response_id seems to help increase the chances of activating the caching, but it will only be suited for conversations.

Here is the outputs of a real test (no previous_response_id involved):

#1st request - not expected to cache
- Cached: 0/1239 Usage: ResponseUsage(input_tokens=1239, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=2, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=1241)
#2nd request - too fast, might not cache
- Cached: 0/1239 Usage: ResponseUsage(input_tokens=1239, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=2, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=1241)
#3rd request - after a few seconds
- Cached: 0/1239 Usage: ResponseUsage(input_tokens=1239, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=2, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=1241)
...1 minute pause before proceeding...
#4th request - after an extra minute
- Cached: 0/1239 Usage: ResponseUsage(input_tokens=1239, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=2, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=1241)
...1 minute pause before proceeding...
#5th request - after an extra minute
- Cached: 1024/1239 Usage: ResponseUsage(input_tokens=1239, input_tokens_details=InputTokensDetails(cached_tokens=1024), output_tokens=2, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=1241)
Elapsed: 128.48 seconds since the first response.

Written here two days ago:

Essentially: use it, or there is little hope of getting a discount. Plenty of undiscounted gpt-5-mini calls, even running the same input.

It is a top-level API parameter, alongside “model”.


TEST: gpt-4.1-nano, chat completions.

The nonce message was 434 characters long.

input tokens: 1440 output tokens: 9
uncached: 1440 non-reasoning: 9
cached: 0 reasoning: 0

============== RESTART:
The nonce message was 604 characters long.

input tokens: 1440 output tokens: 9
uncached: 160 non-reasoning: 9
cached: 1280 reasoning: 0

The persistence is extremely short when I just tested, not enough time to be typing a serious prompt. (the AI also can’t count the message with 1400 extra characters)

+1, it’s very inconsistent even with static prompt and cache key. I have found that for best results you need to “prime” the cache first. For example try sending 50 prompts every 2 sec. About halfway through that you start getting frequent hits, at about 80%. But the frustrating thing is it’s been impossible to figure out the optimal priming formula. Eg. 15 prompts every 10 sec sometimes seem to do it, other times not.

OpenAI, please give us more details how cache works, the docs are too vague. See Prompt caching with tools - API - OpenAI Developer Community for more questions.

1 Like

One additional finding: prompt_cache_key works very well with gpt-5 and gpt-5-mini (but not gpt-5-nano and not any of the other models). I ran similar 50 request batches with all models. For older models, prompt_cache_key makes no difference and, after priming the cache, you get about an 80% hit rate. For gpt-5 and gpt-5-mini, the behavior without prompt_cache_key is very similar but with prompt_cache_key

  1. the cache is primed very quickly, often I get a cache hit already on the second query
  2. once primed, the cache hit rate is about 95% !

Definitely seems prompt_cache_key is meant to be used with these two models and helps optimize cache use