Let’s say I created batch with same system prompt more than 1024 tokens. The prompt caching will work same as normal chat completion?
Hi @thanawarat.jongchiew ! Yes it will, and in fact it’s probably the most effective way of utilizing caching, since the batches will be executed back-to-back.
It means I can reduce cost by 4 times? 50 % by batch and 50 % by caching.
If you look at the pricing page, there are specific prices given for input by context cache, and separate input price given by batch.
https://openai.com/api/pricing/
They are not combined and there is no “cache” pricing under batch.
If caching does work to reduce computation, there is no indication that this would be passed along as savings. Caching also relies on the inputs being run against the same routing to an API server destination, within a time window, which batch may not do or have any understanding of.
I wouldn’t put any hard numbers or % on savings, but to give you an idea - the last batch jobs we ran corresponded to 50% of our input tokens being cached, almost exactly. Again, take the number there with a grain of salt, but this is how it turned out in our case. So yes, where it makes sense, I would opt for doing a batch job. Obviously this doesn’t make sense for all use cases, but where you are extracting or parsing information from large amount of data and outputting it in structured/JSON form, and you have a decently large static system prompt, this approach would give you the maximum savings.