Cannot get the client to cache system instructions (100k+ tokens)

Ashutosh_Bind · May 31, 2025, 4:02pm

Am trying to cache a long system prompt, that has some few shot examples and the manim lib reference, for an animation generator that I was building, the sys instructions are static

response = client.responses.parse(
        instructions=devinstruction,
        input=animationRequest,
        user="manim",
        text_format=Result,
        model="o4-mini",
 )

Still, even if I hit thrice in a 10 min frame, with each req after a couple mins, I couldn’t get the input tokens to be cached

user@playground:/home/damner/code$ python3 utils.py
ResponseUsage(input_tokens=120605, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=2732, output_tokens_details=OutputTokensDetails(reasoning_tokens=2432), total_tokens=123337)
user@playground:/home/damner/code$ python3 utils.py
ResponseUsage(input_tokens=120605, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=2525, output_tokens_details=OutputTokensDetails(reasoning_tokens=2176), total_tokens=123130)
user@playground:/home/damner/code$ python3 utils.py
ResponseUsage(input_tokens=120605, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=1032, output_tokens_details=OutputTokensDetails(reasoning_tokens=704), total_tokens=121637)

aprendendo.next · May 31, 2025, 4:19pm

There are a few tips in the docs, like setting user information as caching consider it too when determining which machine will actually store the caching (sometimes it spills over to another machine, losing the cache).

Could you perhaps share an outline on how you are structuring your prompting? There might be some detail we are missing. It seems everything is fixed except for animationRequest, is that so?

The important thing is that the start of the prompt must be an exact match for it to take action, so any changes should be made at the later part of your inputs.

Ashutosh_Bind · May 31, 2025, 5:26pm

that’s exactly the case, other than the animationReq, everything is static

Ashutosh_Bind · May 31, 2025, 5:30pm

And I currently don’t want to bucket or route the caching conditionally, so the user is also the same

aprendendo.next · May 31, 2025, 5:35pm

It may sound like a bit of a peculiar suggestion, but have you tried putting a system role message as the input parameter instead of using instructions, to ensure that the user message (the changed part) comes after the instructions?

mat.eo · May 31, 2025, 5:37pm

You can try hashing the created object and seeing if the value changes.

A quick glance takes me here:

github.com/openai/openai-python

src/openai/resources/responses/responses.py

main


      
                          extra_query=extra_query,
                          extra_body=extra_body,
                          starting_after=NOT_GIVEN,
                          timeout=timeout,
                      ),
                      text_format=text_format,
                      input_tools=tools,
                      starting_after=starting_after if is_given(starting_after) else None,
                  )
          
          def parse(
              self,
              *,
              input: Union[str, ResponseInputParam],
              model: Union[str, ChatModel],
              text_format: type[TextFormatT] | NotGiven = NOT_GIVEN,
              tools: Iterable[ParseableToolParam] | NotGiven = NOT_GIVEN,
              include: Optional[List[ResponseIncludable]] | NotGiven = NOT_GIVEN,
              instructions: Optional[str] | NotGiven = NOT_GIVEN,
              max_output_tokens: Optional[int] | NotGiven = NOT_GIVEN,
              metadata: Optional[Metadata] | NotGiven = NOT_GIVEN,

body=maybe_transform(
                {
                    "input": input,
                    "model": model,
                    "include": include,
                    "instructions": instructions,
                    "max_output_tokens": max_output_tokens,
                    "metadata": metadata,
                    "parallel_tool_calls": parallel_tool_calls,
                    "previous_response_id": previous_response_id,
                    "reasoning": reasoning,
                    "store": store,
                    "stream": stream,
                    "temperature": temperature,
                    "text": text,
                    "tool_choice": tool_choice,
                    "tools": tools,
                    "top_p": top_p,
                    "truncation": truncation,
                    "user": user,
                },
                response_create_params.ResponseCreateParams,
            ),
            options=make_request_options(
                extra_headers=extra_headers,
                extra_query=extra_query,
                extra_body=extra_body,
                timeout=timeout,
                post_parser=parser,
            ),
            # we turn the `Response` instance into a `ParsedResponse`
            # in the `parser` function above
            cast_to=cast(Type[ParsedResponse[TextFormatT]], Response),
        )

That would be my first step: eliminate the chance of it being the request object. You’ve said it’s static and the tokens are the same, but you never know.

lucid.dev · May 31, 2025, 5:42pm

When you say the “sys” instructions are static, you mean what is passed in “instructions” instead of in “input”?

I don’t think that “instructions” hits the cache. I think only input does.

Did you try sending it all within input instead of within “instructions” to confirm if this is the case?

Obviously it’s not what you desire, but if this pathway is successful then you know that likely instructions cannot hit the cache where as input can, and you might have to modify your calls accordingly…

Though it would be kind of odd, it’s hard to say how the responses API is treating everything. It’s much more clear about how you hit the cache when using a simplistic endpoint like completions, but responses API is doing so much transformation on the backend in a hidden way, converting what you are providing at the end of the day into a giant single block of formatted text as the context window - the model never receives things separately, it receives them at the end of the middleware transformation taking place through the responses API as a single large blob of text, and I think that this is what has to hit the cache (i.e. the end-result of the middleware “Responses” transformations, immediately before it’s actually provided to the model)

_j · May 31, 2025, 6:04pm

The instructions parameter with the Responses endpoint will definitely break the cache if it is altered. It is a first system message placed, and then anything you send in the input field, or by reusing a response id, comes after that.

Also cache-breaking is altering the “user” parameter. You can use the “user” parameter to match your customer ID to a server instance with their prior usage better if your organization is busy with similar API calls.

The problem may be that if you are sending too large of a context, and using the truncation: auto parameter, that messages may be auto-expiring in different amounts with every new addition.

Then: the cache can be 5-10 minutes, and simply not guaranteed.

Ashutosh_Bind · May 31, 2025, 7:08pm

response = client.chat.completions.create(
model=“o4-mini”,
messages=[
{“role”: “developer”, “content”: devinstruction},
{“role”: “user”, “content”: animationRequest}
],
response_format={
“type”: “json_schema”,
“json_schema”: {
“name”: “generate_refs_response”,
“schema”: result_schema,
“strict”: True
}
},
user=“manim”
)

Tried getting rid of responses api, tried adding the system instructions as a role message, also mentioned the user field, still no luck

Solution Note Edit: This worked, but I was also hitting at a possibly busy time, and I put this one under more stress to actually test things out, so maybe the first approach could’ve been perfectly fine too, thanks!

OnceAndTwice · May 31, 2025, 7:19pm

I think oftentimes the simplest answer is the right answer.

If you’re querying at a busy time, then the cache will quickly become saturated by other users who call the models more frequently. Cache discounts aren’t guaranteed so even if you’re doing everything right, you might just be getting kicked out of the cache by these other customers.

aprendendo.next · May 31, 2025, 7:21pm

Have you had any success with less tokens? How many times are you sending the subsequent tries, and did you wait some time to allow cache to get effective? I noticed that it may take a little bit and a few requests sometimes.

If that still doesn’t work, I may have another unusual suggestion that you can try, that would involve responses API forks. Let me know if you need more details.

Ashutosh_Bind · May 31, 2025, 7:25pm

This seems to work, the cache consistency is still not very deterministic, but that’s just expected from a shared infra solution, but I am getting some cache hits, as mentioned by others this could also be due to the high demand.

Ashutosh_Bind · May 31, 2025, 7:27pm

Didn’t even have to reduce the tokens, just started working

lucid.dev · June 1, 2025, 4:52am

So your saying the solution was to use the completions API instead of responses?

sps · June 2, 2025, 6:19pm

I can confirm that I was unable to hit cache on Responses API calls with o4-mini. Escalated the same with the OpenAI team.

Topic		Replies	Views
Consistent cache breaks with o4-mini and previous_response_id Bugs api , o4-mini	3	96	May 6, 2025
Cache not caching more than 1024 tokens (expected: increments of 128 tokens) Bugs prompt-caching	6	206	November 14, 2024
How to save input tokens in Responses API? API responses	5	213	May 23, 2025
How Prompt caching works? API assistants-api , prompt-caching	17	6419	February 4, 2025
How can I reduce API costs with repeated prompts? API api , assistants-api	10	336	May 7, 2025

Cannot get the client to cache system instructions (100k+ tokens)

Related topics