Here is my understanding of how rate limiting works:
If my rate limit is, for instance, 1000 tok/min, and I already used 900 within the past minute, then if I send a new request for 101, this gets rejected (even if the actual completion would have been shorter, since the rate limiter assumes the worst - e.g. the output will be exactly max_length tokens).
Here’s what is not clear to me:
Suppose in my first request, I asked for 900 tokens but only got 100. Would requesting an additional 101 tokens get rate-limited (e.g. the rate-limiter only considers requested tokens, a.k.a. max_length, for previous requests, regardless of actual completion length), or will it still go through (e.g. the rate limiter disregards the max_length of the previous request since it knows that I only really used 100 actual tokens, so it will allow me to get another 101 since that would really only put me at 201)?