Why no prompt caching for batch jobs?

manouchehri · June 17, 2025, 4:28pm

I don’t understand why flex processing is able to give a discount on cached input tokens, but requests sent as “proper” batches are not eligible for prompt caching.

(With Azure OpenAI, I can actually see prompt caching is “working”.. they just aren’t discounting input tokens .)

OnceAndTwice · June 17, 2025, 6:06pm

Lack of a QA team. And also likely because batches make it “too easy” to get cache hits consistently.

_j · June 17, 2025, 6:36pm

There’s no reason to think that OpenAI wouldn’t minimize the computation via technology if the technology facilitated that. Which is what context window caching does, reducing recomputations.

We can look at the technology so far disclosed as a way to infer why:

Inbound API calls have the start of input and the user field hashed, to then route the similar call to the same server with a local cache state. That mechanism can do about 15 calls per minute before rotating out calls to another instance for servicing.

Batch processing, however, is by using off-peak resources. They don’t follow a growing “chat” pattern. There’s also a bunch of them you might have sent, and they might be serviced in parallel when reached. Parallel and simultaneous would not find a prebuilt cache, and parallel and distributed would not employ a previous unit with a state. Or some run now, some run hours later, past expiry.

It also would take some preprocessing of the whole batch for efficiency: what actually is the commonality between calls, and which call beyond a 256 token hash is the best for creating initial cache, vs others that are similar. Then sorting and ranking them, and running just one and holding back others to make that cache. Then, you might have sent a bunch of subbatches.

So there is some “doable” in there, but there’s also understandable “not actually implemented”, or “not going to make any promises about you getting any discount, you’re already discounted”.

manouchehri · June 17, 2025, 6:47pm

@_j Multiple misunderstandings in your reply, I’ll try to clear them up.

There’s no reason to think that OpenAI wouldn’t minimize the computation via technology if the technology facilitated that.

The “technology” already facilitates that for flex processing.

Batch processing, however, is by using off-peak resources.

Yes, just like flex processing too..

There’s also a bunch of them you might have sent, and they might be serviced in parallel when reached. Parallel and simultaneous would not find a prebuilt cache, and parallel and distributed would not employ a previous unit with a state. Or some run now, some run hours later, past expiry.

This is irrelevant and is obvious; yes, parallel requests will have cache misses. For those of us doing things at scale, OpenAI does not process all our requests at once.

not going to make any promises about you getting any discount

I am not asking for any promises. Prompt caching with OpenAI is never a promise to begin with.. Again, this is the same for any type of request, not just flex or batch.

Topic		Replies	Views
Can Batch api work with prompt caching? API batch-api	4	1816	December 6, 2024
Prompt Caching in Batching API API batch-api , cache	2	933	April 6, 2025
Batch API - System Prompt Caching - Is it possilbe to cache system prompt from single batch job and reuse it across multiple batches? API batch-api	2	600	June 11, 2025
Batch API vs Prompt caching API batch-api , prompt-caching	1	1175	October 14, 2024
How to reduce token usage by repeating system prompt each time for batch API API	3	305	October 25, 2025

Why no prompt caching for batch jobs?

Related topics