Discrepancy: omni-moderation-latest Token Usage vs Tier 1 Rate Limits (Vision)

Hello everyone,

I am implementing an image moderation pipeline using the `omni-moderation-latest` model. I am currently on Tier 1, which states a limit of 10,000 TPM (Tokens Per Minute)→ https://platform.openai.com/docs/models/omni-moderation-latest.

However, I am observing a significant discrepancy between the documented “Vision” token calculation (https://platform.openai.com/docs/guides/images-vision#overview) and the actual Rate Limit enforcement.

The Context:

  • Model: omni-moderation-latest

  • Input: 20 consecutive images.

  • Dimensions: 2500x1667 px per image.

  • Execution Time: ~33 seconds.

The Math (Based on GPT-4o Vision specs): According to the standard Vision formula (High Detail):

  1. Scale to 2048x1365 → Scale shortest side to 768px → 1152x768.

  2. Tiling (512px): 3x2 = 6 tiles.

  3. Cost: (6 * 170) + 85 = 1,105 tokens/image.

  4. Total theoretical load: 20 * 1,105 = 22,100 tokens.

The Anomaly: My script successfully processed these 22,100 tokens in 33 seconds without receiving any 429 Too Many Requests error. This is 220% of my stated 10k TPM limit, sustained over 30 seconds.

My Questions:

  1. Does omni-moderation-latest use a different token counting logic than standard GPT-4o Vision? (e.g., is it treated as “Low Detail” by default regardless of resolution?)

  2. Are Rate Limits for the Moderation endpoint decoupled from the standard TPM quotas displayed in the dashboard?

  3. Does the API response headers include any specific x-ratelimit-used-tokens for this model? (I couldn’t find consistent values).

I need to understand if this behavior is a “feature” (generous limits for safety) or a “bug” (delayed throttling), as I cannot rely on undefined limits for a production pipeline.

Thanks for the clarification.

1 Like

The rate limiter itself that is an edge point cannot do deep inspection of images or understand your setting on chat models.

You get a fixed rate limit penalty for “high” or “low” regardless of what you actually send.

Here’s me making calls, where I have a tier rate limit high enough that these are essentially independent (full reset in milliseconds), and I can look at the amount of my full rate vs the return deduction of the call in x-ratelimit headers:

detail:low

gpt-4o, Images: 0, Size: N/A, Tokens Estimated: 45, usage: 45, Rate usage: 46
gpt-4o, Images: 1, Size: 400x400, Tokens Estimated: 130, usage: 130, Rate usage: 811
gpt-4o, Images: 1, Size: 1200x400, Tokens Estimated: 130, usage: 130, Rate usage: 811
gpt-4o, Images: 2, Size: 400x400, Tokens Estimated: 215, usage: 215, Rate usage: 1576
gpt-4o, Images: 2, Size: 1200x400, Tokens Estimated: 215, usage: 215, Rate usage: 1576

detail: high

gpt-4o, Images: 0, Size: N/A, Tokens Estimated: 45, usage: 45, Rate usage: 46
gpt-4o, Images: 1, Size: 400x400, Tokens Estimated: 300, usage: 300, Rate usage: 811
gpt-4o, Images: 1, Size: 1200x400, Tokens Estimated: 640, usage: 640, Rate usage: 811
gpt-4o, Images: 2, Size: 400x400, Tokens Estimated: 555, usage: 555, Rate usage: 1576
gpt-4o, Images: 2, Size: 1200x400, Tokens Estimated: 1235, usage: 1235, Rate usage: 1576

The images 0 shows the overhead of just prompt and the slight inaccuracy of its estimate.

All the remaining requests of any type show the rate limiter counts an image as 765 tokens.

If you blast async in near-parallel at the moderations API endpoint until you do hit a limit, you might find this is the same estimate used there. It doesn’t return rate headers last I checked.

This is the missing piece of the puzzle! Thank you for the clear distinction between usage (actual compute) and Rate usage (pre-deducted quota). The “dumb rate limiter” theory applying a fixed penalty regardless of image size explains a lot.

However, based on my stress tests, it seems the “Fixed Penalty” for the Moderation endpoint is significantly lower than the ~800 tokens you observe on gpt-4o.

Here is the math based on my Tier 1 limits (10k TPM):

  • Scenario: I sent 50 consecutive High-Res images in ~90 seconds.

  • If Moderation used the same rate as GPT-4o (~800 tokens): I would have hit the Rate Limit wall around the 13th request (13 \\times 800 > 10,000).

  • Reality: All 50 passed successfully (31 RPM).

This suggests that while the mechanism is the same (fixed cost), omni-moderation-latest likely has a much “cheaper” fixed entry cost (estimated around 200-300 tokens/image), effectively behaving like a forced “Low Detail” input.

A follow-up question:

You mentioned seeing the x-ratelimit headers on your GPT-4o calls. On the moderations endpoint, I am not receiving these specific headers in the response (it seems opaque). Do you know if there is a specific parameter or header key to look for to expose the “Rate usage” specifically for this endpoint?

Thanks again!

No rate headers is the “last I checked” status cited above.

Do you know how to get response headers at all and are already seeing them on chat endpoints? They are not presented by the OpenAI SDK unless you use .with_raw_response.method() in your call, then you must parse out from the httpx shape object (in Python).

I just happen to have a console log open: images API and its “x-” headers … also no x-ratelimit there (remaining image count would be needed despite the gpt-image models also having TPM limit corresponding to 32k per image)

“_http_headers”: {
“openai-version”: “2020-10-01”,
“openai-organization”: “My_Org”,
“openai-project”: “proj_1234”,
“x-request-id”: "req_456789,
“openai-processing-ms”: “31410”,
“x-envoy-upstream-service-time”: “30587”,
“x-content-type-options”: “nosniff”
}

Demonstrating that the utility of rate headers is not universally provided.

1 Like

Confirmed. I just ran tests using .with_raw_response on both text-only and image payloads. As you predicted: Zero rate-limit headers present. It is completely opaque.

However, the processing time reveals the hidden cost:

  • Text Request: openai-processing-ms: 82 (~0.1s)

  • Image Request: openai-processing-ms: 1316 (~1.3s)

This 16x latency jump confirms that the Vision encoder is computationally expensive, which justifies the “Fixed Token Penalty” on the rate limit quota. Based on my throughput tests, this penalty seems to be around ~300 tokens/image (significantly cheaper than the ~800 tokens for GPT-4o, but heavy enough to limit concurrency).

We cannot monitor the quota in real-time via headers, but empirical evidence confirms a fixed “Vision Tax” per image.

Thanks for your help debugging this!

1 Like