Am trying to cache a long system prompt, that has some few shot examples and the manim lib reference, for an animation generator that I was building, the sys instructions are static
There are a few tips in the docs, like setting user information as caching consider it too when determining which machine will actually store the caching (sometimes it spills over to another machine, losing the cache).
Could you perhaps share an outline on how you are structuring your prompting? There might be some detail we are missing. It seems everything is fixed except for animationRequest, is that so?
The important thing is that the start of the prompt must be an exact match for it to take action, so any changes should be made at the later part of your inputs.
It may sound like a bit of a peculiar suggestion, but have you tried putting a system role message as the input parameter instead of using instructions, to ensure that the user message (the changed part) comes after the instructions?
You can try hashing the created object and seeing if the value changes.
A quick glance takes me here:
body=maybe_transform(
{
"input": input,
"model": model,
"include": include,
"instructions": instructions,
"max_output_tokens": max_output_tokens,
"metadata": metadata,
"parallel_tool_calls": parallel_tool_calls,
"previous_response_id": previous_response_id,
"reasoning": reasoning,
"store": store,
"stream": stream,
"temperature": temperature,
"text": text,
"tool_choice": tool_choice,
"tools": tools,
"top_p": top_p,
"truncation": truncation,
"user": user,
},
response_create_params.ResponseCreateParams,
),
options=make_request_options(
extra_headers=extra_headers,
extra_query=extra_query,
extra_body=extra_body,
timeout=timeout,
post_parser=parser,
),
# we turn the `Response` instance into a `ParsedResponse`
# in the `parser` function above
cast_to=cast(Type[ParsedResponse[TextFormatT]], Response),
)
That would be my first step: eliminate the chance of it being the request object. You’ve said it’s static and the tokens are the same, but you never know.
When you say the “sys” instructions are static, you mean what is passed in “instructions” instead of in “input”?
I don’t think that “instructions” hits the cache. I think only input does.
Did you try sending it all within input instead of within “instructions” to confirm if this is the case?
Obviously it’s not what you desire, but if this pathway is successful then you know that likely instructions cannot hit the cache where as input can, and you might have to modify your calls accordingly…
Though it would be kind of odd, it’s hard to say how the responses API is treating everything. It’s much more clear about how you hit the cache when using a simplistic endpoint like completions, but responses API is doing so much transformation on the backend in a hidden way, converting what you are providing at the end of the day into a giant single block of formatted text as the context window - the model never receives things separately, it receives them at the end of the middleware transformation taking place through the responses API as a single large blob of text, and I think that this is what has to hit the cache (i.e. the end-result of the middleware “Responses” transformations, immediately before it’s actually provided to the model)
The instructions parameter with the Responses endpoint will definitely break the cache if it is altered. It is a first system message placed, and then anything you send in the input field, or by reusing a response id, comes after that.
Also cache-breaking is altering the “user” parameter. You can use the “user” parameter to match your customer ID to a server instance with their prior usage better if your organization is busy with similar API calls.
The problem may be that if you are sending too large of a context, and using the truncation: auto parameter, that messages may be auto-expiring in different amounts with every new addition.
Then: the cache can be 5-10 minutes, and simply not guaranteed.
Tried getting rid of responses api, tried adding the system instructions as a role message, also mentioned the user field, still no luck
Solution Note Edit: This worked, but I was also hitting at a possibly busy time, and I put this one under more stress to actually test things out, so maybe the first approach could’ve been perfectly fine too, thanks!
I think oftentimes the simplest answer is the right answer.
If you’re querying at a busy time, then the cache will quickly become saturated by other users who call the models more frequently. Cache discounts aren’t guaranteed so even if you’re doing everything right, you might just be getting kicked out of the cache by these other customers.
Have you had any success with less tokens? How many times are you sending the subsequent tries, and did you wait some time to allow cache to get effective? I noticed that it may take a little bit and a few requests sometimes.
If that still doesn’t work, I may have another unusual suggestion that you can try, that would involve responses API forks. Let me know if you need more details.
This seems to work, the cache consistency is still not very deterministic, but that’s just expected from a shared infra solution, but I am getting some cache hits, as mentioned by others this could also be due to the high demand.