Hello everyone,
I am implementing an image moderation pipeline using the `omni-moderation-latest` model. I am currently on Tier 1, which states a limit of 10,000 TPM (Tokens Per Minute)→ https://platform.openai.com/docs/models/omni-moderation-latest.
However, I am observing a significant discrepancy between the documented “Vision” token calculation (https://platform.openai.com/docs/guides/images-vision#overview) and the actual Rate Limit enforcement.
The Context:
-
Model: omni-moderation-latest
-
Input: 20 consecutive images.
-
Dimensions: 2500x1667 px per image.
-
Execution Time: ~33 seconds.
The Math (Based on GPT-4o Vision specs): According to the standard Vision formula (High Detail):
-
Scale to 2048x1365 → Scale shortest side to 768px → 1152x768.
-
Tiling (512px): 3x2 = 6 tiles.
-
Cost: (6 * 170) + 85 = 1,105 tokens/image.
-
Total theoretical load: 20 * 1,105 = 22,100 tokens.
The Anomaly: My script successfully processed these 22,100 tokens in 33 seconds without receiving any 429 Too Many Requests error. This is 220% of my stated 10k TPM limit, sustained over 30 seconds.
My Questions:
-
Does omni-moderation-latest use a different token counting logic than standard GPT-4o Vision? (e.g., is it treated as “Low Detail” by default regardless of resolution?)
-
Are Rate Limits for the Moderation endpoint decoupled from the standard TPM quotas displayed in the dashboard?
-
Does the API response headers include any specific x-ratelimit-used-tokens for this model? (I couldn’t find consistent values).
I need to understand if this behavior is a “feature” (generous limits for safety) or a “bug” (delayed throttling), as I cannot rely on undefined limits for a production pipeline.
Thanks for the clarification.