Clarification about max_completion_tokens rate-limiting

Based on documentation, the older parameter max_tokens is used to determine rate-limiting. With o1 release, this parameter is deprecated and max_completion_tokens introduced to support “hidden tokens” used in o1. Will this new parameter contribute to total rate limit in the same way?
https://platform.openai.com/docs/guides/rate-limits/error-mitigation

Reduce the max_tokens to match the size of your completions
Your rate limit is calculated as the maximum of max_tokens and the estimated number of tokens based on the character count of your request. Try to set the max_tokens value as close to your expected response size as possible.

2 Likes

Hi @davidz, Welcome to the forum!

Yes, the rate-limit is dictated by the total number of generated tokens, including the reasoning tokens in the o1-series of models.

1 Like

max_tokens sets the maximum response length of the API model call. I wouldn’t really call it a rate limit, but rather a safety mechanism.

Were the AI to go off the rails, and enter a looping output forever, this parameter would truncate the output at a length below the maximum context length of the model still available, or the maximum response length of newer models, limited to the value you set. You get a response cut off mid-sentence. The AI doesn’t receive this parameter to behave differently, the parameter just turns off the generation when reached.

The new o1 parameter is also able to perform similar safety. It can operate on the expense of unseen tokens used in reasoning also. The difference would be that the operation of the o1 model might be terminated before you get any tokens out as output.

The reasoning part does use a huge quantity of tokens - you might even get 10x as many billed as are finally produced. So as safety, you must set this not just higher than any response you expect, but much higher than the model might ever reasonably consume, otherwise you will be paying for a partial output never received, or similarly truncated.


Rate limits as a term would be more applicable to overall use of the API. If you reach a rate limit, such as your account’s token-per-minute limit per model, the entire request will simply be rejected before anything is performed. You can lower the rate even further in a project, so that API keys and models can’t be used any faster than you would be using them yourself in practice, offering a slice of safety in slowing down the abuse of leaked keys.

I might be misunderstanding the documentation but the way I read “Your rate limit is calculated as the maximum of max_tokens and the estimated number of tokens based on the character count of your request.” suggests otherwise.
As an example, suppose I have a rate limit of 1000 tokens per {some unit}, and I have requests that only return 3 tokens. If I generate 10 requests with the max_tokens set to 100, that would only end up outputting 3 * 10 = 30 tokens, but 100 * 10 = 1000 tokens would be counted towards rate limit. The next request would return a rate-limiting error despite only returning a total of 30 tokens across requests. (hand waving input tokens for this example).
Maybe someone can correct me there if I’m misreading.

Question extends to max_completion_tokens- I just want to confirm behavior is same there.

1 Like

That’s talking about your consumption of existing rate limit per minute, which is based on the total tokens consumed, both input and output, and is actually just a close estimate in practice.

The reason it is stated that way is that max_tokens is used for estimating how much your current request will consume, before anything is generated. That estimate is just used for blocking a request that would go over - the actual consumption after finish is what impacts what remains for the evaluation period.

The effect of the max_tokens parameter = shutting off the generative output

image