Hi all, I recently moved my script from Chat Completions (with GPT-4o) to Responses calls (with GPT-4.1 and now GPT-5).
With Chat Completions, prompt caching worked perfectly - with my code noting also the input cached tokens every call.
I used both system and user messages - the first one being a 14k token fixed part that was cached constantly.
When I moved to Responses, I included all the system instructions (the fixed string of about 14k tokens) into instructions parameter, leaving the conversation between user and assistant into input (I manage the conversation state manually).
I’m using a fixed prompt_cache_key parameter for this script.
I pass always the same json_schema for the output.
Session_id (for Metadata logging) changes from chat to chat, but it’s fixed into a single chat.
Example of payload of my response call:
payload = {
"model": GPT_MODEL,
"reasoning": {"effort": GPT_REASONING_EFFORT},
"instructions": instructions,
"input": messages,
"metadata": {
"session_id": session_id
},
"store": True,
"prompt_cache_key": prompt_cache_key,
"text": {
"format": {
"type": "json_schema",
"name": schema_name,
"strict": True,
"schema": schema
}
}
}
With this setup I cannot manage to push into the prompt cache server any input token, when I know perfectly that the instructions parameter Is long enough to trigger the cache, being Always the same.
I verified it both from cached_tokens = 0 and Dashboard Usage statistics- only Input and Output tokens used…
Can someone give me some advice to correct this behavior?
Thank you!
C