Why no prompt caching for batch jobs?

I don’t understand why flex processing is able to give a discount on cached input tokens, but requests sent as “proper” batches are not eligible for prompt caching.

(With Azure OpenAI, I can actually see prompt caching is “working”.. they just aren’t discounting input tokens :frowning: .)

Lack of a QA team. And also likely because batches make it “too easy” to get cache hits consistently.

1 Like

There’s no reason to think that OpenAI wouldn’t minimize the computation via technology if the technology facilitated that. Which is what context window caching does, reducing recomputations.

We can look at the technology so far disclosed as a way to infer why:

Inbound API calls have the start of input and the user field hashed, to then route the similar call to the same server with a local cache state. That mechanism can do about 15 calls per minute before rotating out calls to another instance for servicing.

Batch processing, however, is by using off-peak resources. They don’t follow a growing “chat” pattern. There’s also a bunch of them you might have sent, and they might be serviced in parallel when reached. Parallel and simultaneous would not find a prebuilt cache, and parallel and distributed would not employ a previous unit with a state. Or some run now, some run hours later, past expiry.

It also would take some preprocessing of the whole batch for efficiency: what actually is the commonality between calls, and which call beyond a 256 token hash is the best for creating initial cache, vs others that are similar. Then sorting and ranking them, and running just one and holding back others to make that cache. Then, you might have sent a bunch of subbatches.

So there is some “doable” in there, but there’s also understandable “not actually implemented”, or “not going to make any promises about you getting any discount, you’re already discounted”.

@_j Multiple misunderstandings in your reply, I’ll try to clear them up.

There’s no reason to think that OpenAI wouldn’t minimize the computation via technology if the technology facilitated that.

The “technology” already facilitates that for flex processing.

Batch processing, however, is by using off-peak resources.

Yes, just like flex processing too..

There’s also a bunch of them you might have sent, and they might be serviced in parallel when reached. Parallel and simultaneous would not find a prebuilt cache, and parallel and distributed would not employ a previous unit with a state. Or some run now, some run hours later, past expiry.

This is irrelevant and is obvious; yes, parallel requests will have cache misses. For those of us doing things at scale, OpenAI does not process all our requests at once.

not going to make any promises about you getting any discount

I am not asking for any promises. Prompt caching with OpenAI is never a promise to begin with.. Again, this is the same for any type of request, not just flex or batch.