The API gives a 50% discount on regular completions and much higher rate limits (250M input tokens enqueued for GPT-4T). Results guaranteed to come back with 24hrs and often much sooner.
Copying from our team’s FAQ on this feature for the first half of your question:
“There is no fixed limit on the number of requests you can batch; however, each usage tier has an associated batch rate limit. Your batch rate limit includes the maximum number of input tokens you have enqueued at one time. You can find your rate limits here.”
Then regarding the input file size, I believe it falls into a similar category as above. It isn’t strictly tied to file size, but also tokens, and you can use the file upload API doc as a guide.
I have a question about the upper limit on file size, particularly because vision support means that including images in batch requests could significantly increase file sizes, even when using low detail mode where each image is just 85 tokens.
I noticed the files guide discusses the upper limits for the Assistants API, and max standard storage for orgs.
Do the Assistants API limits apply to the Batch API as well?
Hi, @jeffsharris! I see many use cases for the BatchAPI in our pipelines. I’m just missing the JSON mode. I couldn’t find any relevant information in the documentation or in the FAQ. Though, we can always prompt a model to get structured outputs, JSON mode improves reliability. Is it available or coming soon?
I see a few use cases here, and would love to get more from the community:
OpenAI wants to reduce its primary server loads, and wants to move a % of use cases into batch mode. I can assume that 90% of use cases require immediate responses, but some production systems do not - for example, writing batch emails to 50,000 users.
A lot of AI tasks like summarisation could be moved into this mode (especially if there is a 50% reduction in cost) and its available in 24 hours
JSONL might still be a limitation for non-tech users
Considering the default timeframe is 24 hours, a push connection open for you for half the day may not be that practical. An expectation that you are immediately queued for prompt execution might be disappointment.
Or, it is easy to speculate, move anything they can off of peak times, and reward those who will wait for idle time with the lower price. You can look at the daily and weekly response times and see exactly when these should run, no algorithm needed. Then add an empty queue of fine-tuning and other servers that might switch to language model.
Also, that off-peak is likely the reason for a 24 hour window; the best “window” is really when the world sleeps.
Good for that benchmark that you can pick up from your inbox tomorrow.
There’s a few things that aren’t clear, and we hope the big brains have considered:
Tiered rate limits: everything will be managed, never going over and/or never affecting production rates? edit: documented to be completely separate, only limit is queue in tokens.
Moderations: You wouldn’t want 1000 system prompts that are at a triggering level all going through to accumulate against you (aka check inputs first), or paying for 1000 inputs that get you no more than content_filter stop reason…
Account balances. When you are emptied halfway through or hard limit is hit (if that even works any more…several have reported huge overages), what is expected? Termination of the batch, or it still keeps running on the remainder…(it seems the latter is likely, you just get a file of errors for what couldn’t be paid at the time they ran)
I have two websites running that have a permanent end point for incoming pushes. It’s no burden at all.
It dropped my polling from 10,000’s of requests to 10’s of requests to retrieve results when they are available, a 1000x improvement in efficiency to the significant benefit of both parties.
(This is especially useful when using API’s that have quotas associated with them)
I don’t see why “24 hours” couldn’t be reduced at some stage to a lower average at least …
when the sun is on the middle of the Pacific Ocean …
On this topic, I wonder if 50% really reflects the full benefit. You are probably talking about a far greater saving in infrastructure … (but in any case I fully support this move!)
I just intended to complete an example request. File upload went well but upon submission of the batch request I got a 400 error as follows. The error makes no sense as the request should cost me no more than ~$2.5 and the delta to my hard limit based on usage month to date is very far away.
{
"error": {
"message": "Billing hard limit has been reached",
"type": "invalid_request_error",
"param": null,
"code": "billing_hard_limit_reached"
}
}
Quick edit for anyone running into this problem:
It seems to have been related to the recent switch to Project API Keys. The same batch request went through successfully with a newly created Project API Key.
Wondering if the return window will have other options in the future than just 24 hours? Like 12 or 16 hours? Being able to say “within 24 hours” is so much better than “24 hours and some change”