Respones API - how does prompt caching work and its cost implications

fwaris · July 3, 2025, 7:42pm

I am using the new Responses API (for CUA with o4-mini as the reasoner). (Overall the results are decent with reasoner ‘attached’ - otherwise CUA gets lost).

I want to accurately calculate the cost of completing a CUA task.

I can get the usage tokens to estimate the ballpark cost. However, I don’t fully understand how prompt caching works with ‘Responses’.

I am saving all requests and have truncation=auto so all state is on the server.

Can I assume that all the previous context is cached and so the costs are lower for existing input tokens and new input tokens incur the full cost?

_j · July 3, 2025, 11:33pm

There’s a good amount of information in the “documentation” link on the side.

Some clarifications, though:

“Store” does not affect the cache or its persistence at all. It is simply making repeated calls in a short time, a window 5-60 minutes depending on server load, that will reuse any cache automatically created and temporarily stored.

The activation is when the first 1024 tokens or greater are the same and the API call has been routed to a server that was just previously used. Then the input that is in common can have a discount applied for reuse of server input state.

fwaris · July 4, 2025, 11:51am

Jay, thanks for the clarification.

In order to reduce cost, I would like to reduce the context window size for o4-mini but I can’t find a way to limit the number of tokens or delete older messages that are saved on the server.

I this case, I don’t what to keep the state on the client as that would mean sending large images over the network repeatedly.

Is my understanding correct? If so is there any solution that can be applied here?

_j · July 4, 2025, 12:55pm

It was immediately recognizable on the day of release that the server state was unsuitable for practical development, precisely because of the lack of any cost management.

You have two “truncation” options:

run the AI model context up to the maximum until you get an error
run the AI model context up to the maximum, and then get cache-breaking message discarding.

Responses can only be reused. The content of the ID cannot be altered.

For self-management with images, you can upload the image file once to files storage, and then continue to reference it by file id so it does not need to be re-transmitted.

Topic		Replies	Views
Responses API high token consumption API responses , responses-api	7	254	June 23, 2025
Reponses API vs Prompt caching API api	0	209	April 17, 2025
How to use cached_tokens field to calculate cost estimation API pricing , api-costs , prompt-caching , cache	1	149	July 31, 2025
How to save input tokens in Responses API? API responses	5	444	May 23, 2025
Prompt Caching for o3-mini? API o3-mini	3	568	February 4, 2025

Respones API - how does prompt caching work and its cost implications

Related topics