Respones API - how does prompt caching work and its cost implications

I am using the new Responses API (for CUA with o4-mini as the reasoner). (Overall the results are decent with reasoner ‘attached’ - otherwise CUA gets lost).

I want to accurately calculate the cost of completing a CUA task.

I can get the usage tokens to estimate the ballpark cost. However, I don’t fully understand how prompt caching works with ‘Responses’.

I am saving all requests and have truncation=auto so all state is on the server.

Can I assume that all the previous context is cached and so the costs are lower for existing input tokens and new input tokens incur the full cost?

There’s a good amount of information in the “documentation” link on the side.

Some clarifications, though:

“Store” does not affect the cache or its persistence at all. It is simply making repeated calls in a short time, a window 5-60 minutes depending on server load, that will reuse any cache automatically created and temporarily stored.

The activation is when the first 1024 tokens or greater are the same and the API call has been routed to a server that was just previously used. Then the input that is in common can have a discount applied for reuse of server input state.

Jay, thanks for the clarification.

In order to reduce cost, I would like to reduce the context window size for o4-mini but I can’t find a way to limit the number of tokens or delete older messages that are saved on the server.

I this case, I don’t what to keep the state on the client as that would mean sending large images over the network repeatedly.

Is my understanding correct? If so is there any solution that can be applied here?

It was immediately recognizable on the day of release that the server state was unsuitable for practical development, precisely because of the lack of any cost management.

You have two “truncation” options:

  • run the AI model context up to the maximum until you get an error
  • run the AI model context up to the maximum, and then get cache-breaking message discarding.

Responses can only be reused. The content of the ID cannot be altered.

For self-management with images, you can upload the image file once to files storage, and then continue to reference it by file id so it does not need to be re-transmitted.