For the life of me, I can’t seem to understand how the key is used. As per the API reference documentation, prompt_cache_key can be passed in the request body of /v1/responses. The following is my request body I fire from Postman:
{
“model”: “gpt-4.1-nano”,
“prompt_cache_key”: “yolo”
“input”: [
{
“role”: “developer”,
“content”: “Extremely long prompt which easily goes over 1024 tokens”
},
{
“role”: “user”,
“content”: “Really long content coming from user”
}
]}
No matter how many times I call the API with the exact same developer content (and user content), cached tokens are still 0 in the response even though I have added the prompt_cache_key in the request body as you can see above. If I move the key inside the input to developer role, I get an error that the field is unknown.
I can’t understand what is wrong and I can’t find any example that shows how to use it properly. My developer prompt is extremely long and it isn’t getting cached so I really need to leverage that. All help is greatly appreciated
Essentially: use it, or there is little hope of getting a discount. Plenty of undiscounted gpt-5-mini calls, even running the same input.
It is a top-level API parameter, alongside “model”.
TEST: gpt-4.1-nano, chat completions.
The nonce message was 434 characters long.
input tokens: 1440
output tokens: 9
uncached: 1440
non-reasoning: 9
cached: 0
reasoning: 0
============== RESTART:
The nonce message was 604 characters long.
input tokens: 1440
output tokens: 9
uncached: 160
non-reasoning: 9
cached: 1280
reasoning: 0
The persistence is extremely short when I just tested, not enough time to be typing a serious prompt. (the AI also can’t count the message with 1400 extra characters)
+1, it’s very inconsistent even with static prompt and cache key. I have found that for best results you need to “prime” the cache first. For example try sending 50 prompts every 2 sec. About halfway through that you start getting frequent hits, at about 80%. But the frustrating thing is it’s been impossible to figure out the optimal priming formula. Eg. 15 prompts every 10 sec sometimes seem to do it, other times not.
One additional finding: prompt_cache_key works very well with gpt-5 and gpt-5-mini (but not gpt-5-nano and not any of the other models). I ran similar 50 request batches with all models. For older models, prompt_cache_key makes no difference and, after priming the cache, you get about an 80% hit rate. For gpt-5 and gpt-5-mini, the behavior without prompt_cache_key is very similar but with prompt_cache_key
the cache is primed very quickly, often I get a cache hit already on the second query
once primed, the cache hit rate is about 95% !
Definitely seems prompt_cache_key is meant to be used with these two models and helps optimize cache use