Feature Request: Add Token Usage Transparency and Real-Time Release in Batch API

Dear OpenAI Support,

I would like to highlight a significant issue in the management of batch jobs via API that directly affects workflow stability and user trust.

Currently, when using POST /v1/batches, a job may be rejected due to exceeding the 90,000 enqueued token limit (total input + max output tokens).
However:

  • The error message does not provide the actual estimated token count (neither total nor per line).
  • There is no pre-validation tool or endpoint to help estimate token usage beforehand.
  • The token validation logic seems opaque and cannot be replicated by the user.

This results in:

  • Workflow disruptions that cannot be debugged easily.
  • Developers being forced to apply overly conservative estimates (e.g., characters / 3) just to avoid rejection.
  • Suboptimal batch construction, with significant underuse of allowed token capacity.

#### A concrete case

In a recent batch:

  • My input file had 4,973 characters.
  • Using a conservative formula, I estimated ~1,657 tokens.
  • But the real token count in prompt_tokens was only 879.
  • That’s 5.65 characters per token, far from the expected 3:1 ratio.

This discrepancy shows that the current lack of feedback leads to substantial inefficiency.

### Suggested improvements

I respectfully request the following improvements to the batch API:

  1. Add a visible estimated token count during batch submission:
  • Either globally for the batch
  • Or per job line
  1. In case of rejection, return:
  • Estimated total enqueued tokens
  • Line-by-line token estimates
  • A breakdown of which lines caused the overage
  1. Provide an API endpoint to estimate token usage (/v1/token-estimate or similar), usable outside batch mode.
  2. Once a batch is completed or fails, expose a live value (e.g., currently_enqueued_tokens) to let users know:
  • How many enqueued tokens are still “reserved”
  • When those tokens will be released

### Additional concern: Token release delay is opaque

It is clear that the system keeps track of how many input tokens are currently “enqueued”, since it blocks new batch submissions based on this invisible quota.
The error message even includes an internal “customer code”, suggesting this value is tracked at account level.

However, after a batch completes (i.e. reaches status = completed), there is no way to know when the previously enqueued tokens are released.

It seems that a background process (perhaps scheduled) eventually clears this quota — but the timing is unknown and undocumented. This adds further unpredictability to batch scheduling.

I suggest that the API should:

  • Subtract enqueued tokens as soon as a batch completes
  • Expose the remaining quota via API in real time
  • Or at least provide a reliable release timing

### Final thoughts

As a developer building an industrial-scale application on top of the OpenAI API, I believe reliability and visibility are essential.
The current behavior — rejecting a batch for token reasons without showing the numbers or letting me know when I can safely retry — is not acceptable in a production environment.

I kindly ask the OpenAI team to consider this issue seriously, and improve the transparency and predictability of batch token management.

Thank you for your attention and the great work you do.
Sergio Bonfiglio

Here is the fault - seen in your quote.

You can use the tiktoken library to exactly measure language tokens yourself.

You will then be able to measure the count of encoded language tokens in combined messages, adding the overhead of 4 tokens per message and 3 per call, generally.

Then more fault is: that an estimator is used by OpenAI also as a rate limiter, not a true token encoder, and it can be off by 20% in typical use. This is where your improvement suggestion can work, to find and show you the result of inaccurate calculations, such as seen in the x-ratelimit headers.

Sorry, but your reply doesen’t make any sense.
First, you presume we’re using Python. This is not. we’re using Delphi and we do not want to make any hybrid solution based on two languages (the death of portability and readability of the code).
Second I don’t understand why I should reinvent the hot water. If there is a computation already done to “limit” the enqueued tokens, why not transfer this data to the user? This is a non-transparent behavior, nor professional.
Third It is ridiculous to calculate the tokens with a methodology that is different from the one on the server-side, because it does not solve the problem.
The one server-side is always the winner, and we’re “on the road again” because our calculation may collide again with the server’s calculation, and the workflow is broken again.
This problem must be solved in the way I described in my post, that is to expose the data in a transparent, complete way.
This is professional. The rest is just talk.