I was looking for some clarification on how the rate limits for tokens are calculated.
For usage, when I make a request with an input prompt and a completion is generated the input/completion prompt token lengths are both calculated.
For rate limiting, the documentation states that:
Your rate limit is calculated as the maximum of
max_tokensand the estimated number of tokens based on the character count of your request
If I understand this correctly, this is saying that before a request is processed it will check if
max(input_prompt_token_length, max_tokens) would take you over the rate limit and then rejects the request.
- Is this understanding correct?
- Once the request has completed - is the total token length (input prompt and output completion) used in our rate limit calculation for the next request or does it still use the estimate number of tokens previously calculated?
For example, if I request a completion and my input prompt is 100 tokens and my
max_tokens param is 200 tokens - for rate limiting purposes it will check if 200 tokens will go over the rate limit and reject the request if it it will.
If the request is valid and service, will the next request assume I have used 200 tokens from my rate limit or 300 tokens from my rate limit?