BatchAPI is now available

The BatchAPI is now available!

The API gives a 50% discount on regular completions and much higher rate limits (250M input tokens enqueued for GPT-4T). Results guaranteed to come back with 24hrs and often much sooner.

For more details, visit the docs:

Give it a try and let us know your feedback!


This is super cool! 50% discount as well? Dang. I can see lots of uses for this.

I’m guessing eventually there will be more time frames that affect the discount?


Kudos on the BatchAPI launch!

Is there an upper limit on the number of requests, tokens, or input file size for this API?

I’m super excited to test this out and can’t wait for it to come to the embeddings endpoint as well.



Copying from our team’s FAQ on this feature for the first half of your question:

“There is no fixed limit on the number of requests you can batch; however, each usage tier has an associated batch rate limit. Your batch rate limit includes the maximum number of input tokens you have enqueued at one time. You can find your rate limits here.”

Then regarding the input file size, I believe it falls into a similar category as above. It isn’t strictly tied to file size, but also tokens, and you can use the file upload API doc as a guide.


Does the input JSONL file still have the 2M language token limitation?

1 Like

yep - exactly. we do hope to build out more time windows


Hi @adjokic ,

Thanks for getting back to me!

I have a question about the upper limit on file size, particularly because vision support means that including images in batch requests could significantly increase file sizes, even when using low detail mode where each image is just 85 tokens.

I noticed the files guide discusses the upper limits for the Assistants API, and max standard storage for orgs.
Do the Assistants API limits apply to the Batch API as well?

Thanks for clarifying this!

1 Like

Hi, @jeffsharris! I see many use cases for the BatchAPI in our pipelines. I’m just missing the JSON mode. I couldn’t find any relevant information in the documentation or in the FAQ. Though, we can always prompt a model to get structured outputs, JSON mode improves reliability. Is it available or coming soon?

Welcome @bruno.nvsx

JSON mode is simply the response format set to json on the chat completion. You can do this by passing response_format with value

{ "type": "json_object" }

in the body of the request input object.


Thanks, @sps! I thought it would be the case but couldn’t test it yet. Now, let’s wait for the other endpoints. :nerd_face:

it’s like they read my mind
I’ve been cursing the lack of batch capabilities all weekend

Do we have to poll to check status?

If so is a pubsub option planned (e.g. like WebSub - Wikipedia) so we can get the results JIT?

Polling is so year 2000’s …

That way you can also deal with your callbacks off-peak too!


I see a few use cases here, and would love to get more from the community:

  1. OpenAI wants to reduce its primary server loads, and wants to move a % of use cases into batch mode. I can assume that 90% of use cases require immediate responses, but some production systems do not - for example, writing batch emails to 50,000 users.
  2. A lot of AI tasks like summarisation could be moved into this mode (especially if there is a 50% reduction in cost) and its available in 24 hours
  3. JSONL might still be a limitation for non-tech users
  4. This is perfect for Production systems

Would love to hear your thoughts…

1 Like

Considering the default timeframe is 24 hours, a push connection open for you for half the day may not be that practical. An expectation that you are immediately queued for prompt execution might be disappointment.

Or, it is easy to speculate, move anything they can off of peak times, and reward those who will wait for idle time with the lower price. You can look at the daily and weekly response times and see exactly when these should run, no algorithm needed. Then add an empty queue of fine-tuning and other servers that might switch to language model.

Also, that off-peak is likely the reason for a 24 hour window; the best “window” is really when the world sleeps.

Good for that benchmark that you can pick up from your inbox tomorrow.

There’s a few things that aren’t clear, and we hope the big brains have considered:

  • Tiered rate limits: everything will be managed, never going over and/or never affecting production rates? edit: documented to be completely separate, only limit is queue in tokens.
  • Moderations: You wouldn’t want 1000 system prompts that are at a triggering level all going through to accumulate against you (aka check inputs first), or paying for 1000 inputs that get you no more than content_filter stop reason…
  • Account balances. When you are emptied halfway through or hard limit is hit (if that even works any more…several have reported huge overages), what is expected? Termination of the batch, or it still keeps running on the remainder…(it seems the latter is likely, you just get a file of errors for what couldn’t be paid at the time they ran)

I have two websites running that have a permanent end point for incoming pushes. It’s no burden at all.

It dropped my polling from 10,000’s of requests to 10’s of requests to retrieve results when they are available, a 1000x improvement in efficiency to the significant benefit of both parties.

(This is especially useful when using API’s that have quotas associated with them)

I don’t see why “24 hours” couldn’t be reduced at some stage to a lower average at least …

1 Like

when the sun is on the middle of the Pacific Ocean …

On this topic, I wonder if 50% really reflects the full benefit. You are probably talking about a far greater saving in infrastructure … (but in any case I fully support this move!)


Anyone successfully tested yet?

I just intended to complete an example request. File upload went well but upon submission of the batch request I got a 400 error as follows. The error makes no sense as the request should cost me no more than ~$2.5 and the delta to my hard limit based on usage month to date is very far away.

    "error": {
        "message": "Billing hard limit has been reached",
        "type": "invalid_request_error",
        "param": null,
        "code": "billing_hard_limit_reached"

Quick edit for anyone running into this problem:
It seems to have been related to the recent switch to Project API Keys. The same batch request went through successfully with a newly created Project API Key.

1 Like

Interesting. Thank you for that. I’m running some experiments now, will see what the results are soon.

Wondering if the return window will have other options in the future than just 24 hours? Like 12 or 16 hours? Being able to say “within 24 hours” is so much better than “24 hours and some change” :grin: