How does rate limiting account for previous requests?

dstekol · March 28, 2024, 2:53pm

Here is my understanding of how rate limiting works:
If my rate limit is, for instance, 1000 tok/min, and I already used 900 within the past minute, then if I send a new request for 101, this gets rejected (even if the actual completion would have been shorter, since the rate limiter assumes the worst - e.g. the output will be exactly max_length tokens).

Here’s what is not clear to me:
Suppose in my first request, I asked for 900 tokens but only got 100. Would requesting an additional 101 tokens get rate-limited (e.g. the rate-limiter only considers requested tokens, a.k.a. max_length, for previous requests, regardless of actual completion length), or will it still go through (e.g. the rate limiter disregards the max_length of the previous request since it knows that I only really used 100 actual tokens, so it will allow me to get another 101 since that would really only put me at 201)?

Topic		Replies	Views
Token Rate limit estimation clarification API	0	713	December 14, 2023
Clarification about max_completion_tokens rate-limiting API rate-limit , o1-preview	4	754	October 10, 2024
Are the used tokens counted when request starts or ends? API rate-limit	2	612	October 28, 2023
Bug: ? Approach token limit, but still get 200 response API token , rate-limit	4	490	April 9, 2024
Inputs tokens limit, data extraction API gpt-4 , gpt-35-turbo , api , token , rate-limit	2	4698	February 3, 2024

How does rate limiting account for previous requests?

Related topics