Bug: ? Approach token limit, but still get 200 response

When I hit my token limit in my api requests, for example, x-ratelimit-remaining-tokens: 253; then I make another api request with 1000 tokens in the prompt. The api returns a 200 response. Why? I was expecting an error to deal with, instead of worrying that I may get unreliable response without warning.

for i in range(100):
    log_info(f"{i} starts")
    response = completions_with_backoff(
        messages=[{"role": "user", "content": "Say this is a test" * 2000}],
    log_info(f"{i} ends")
    for k in response.headers:
        if "ratelimit" in k:
            log_info(f"{k}: {response.headers.get(k)}")



2024-04-10 00:21:52,636 - 5 starts
2024-04-10 00:21:54,340 - 5 ends
2024-04-10 00:21:54,342 - b’{\n “id”: “chatcmpl-9C8WH114MCq3Vv1dpfQmMhhBgx0uH”,\n “object”: “chat.completion”,\n “created”: 1712679713,\n “model”: “gpt-3.5-turbo-0125”,\n “choices”: [\n {\n “index”: 0,\n “message”: {\n “role”: “assistant”,\n “content”: “This is a test.”\n },\n “logprobs”: null,\n “finish_reason”: “stop”\n }\n ],\n “usage”: {\n “prompt_tokens”: 10007,\n “completion_tokens”: 5,\n “total_tokens”: 10012\n },\n “system_fingerprint”: “fp_b28b39ffa8”\n}\n’
2024-04-10 00:21:54,343 - x-ratelimit-limit-requests: 10000
2024-04-10 00:21:54,344 - x-ratelimit-limit-tokens: 60000
2024-04-10 00:21:54,345 - x-ratelimit-remaining-requests: 9994
2024-04-10 00:21:54,346 - x-ratelimit-remaining-tokens: 4787
2024-04-10 00:21:54,346 - x-ratelimit-reset-requests: 46.001s
2024-04-10 00:21:54,347 - x-ratelimit-reset-tokens: 55.212s
2024-04-10 00:21:54,348 - ==============================================
2024-04-10 00:21:54,348 - 6 starts
2024-04-10 00:21:59,509 - 6 ends
2024-04-10 00:21:59,511 - b’{\n “id”: “chatcmpl-9C8WM76pU8Qw5yEsLHgy78rBoQEzQ”,\n “object”: “chat.completion”,\n “created”: 1712679718,\n “model”: “gpt-3.5-turbo-0125”,\n “choices”: [\n {\n “index”: 0,\n “message”: {\n “role”: “assistant”,\n “content”: “This is a test”\n },\n “logprobs”: null,\n “finish_reason”: “stop”\n }\n ],\n “usage”: {\n “prompt_tokens”: 10007,\n “completion_tokens”: 4,\n “total_tokens”: 10011\n },\n “system_fingerprint”: “fp_b28b39ffa8”\n}\n’
2024-04-10 00:21:59,512 - x-ratelimit-limit-requests: 10000
2024-04-10 00:21:59,513 - x-ratelimit-limit-tokens: 60000
2024-04-10 00:21:59,514 - x-ratelimit-remaining-requests: 9993
2024-04-10 00:21:59,514 - x-ratelimit-remaining-tokens: 273
2024-04-10 00:21:59,515 - x-ratelimit-reset-requests: 57.783s
2024-04-10 00:21:59,516 - x-ratelimit-reset-tokens: 59.726s
2024-04-10 00:21:59,516 - ==============================================
2024-04-10 00:21:59,517 - 7 starts
2024-04-10 00:22:09,082 - 7 ends
2024-04-10 00:22:09,085 - b’{\n “id”: “chatcmpl-9C8WWiVhBzteWovKLD8K7dvMlMwfK”,\n “object”: “chat.completion”,\n “created”: 1712679728,\n “model”: “gpt-3.5-turbo-0125”,\n “choices”: [\n {\n “index”: 0,\n “message”: {\n “role”: “assistant”,\n “content”: “This is a test.”\n },\n “logprobs”: null,\n “finish_reason”: “stop”\n }\n ],\n “usage”: {\n “prompt_tokens”: 10007,\n “completion_tokens”: 5,\n “total_tokens”: 10012\n },\n “system_fingerprint”: “fp_b28b39ffa8”\n}\n’
2024-04-10 00:22:09,085 - x-ratelimit-limit-requests: 10000
2024-04-10 00:22:09,086 - x-ratelimit-limit-tokens: 60000
2024-04-10 00:22:09,087 - x-ratelimit-remaining-requests: 9992
2024-04-10 00:22:09,087 - x-ratelimit-remaining-tokens: 312
2024-04-10 00:22:09,088 - x-ratelimit-reset-requests: 1m5.013s
2024-04-10 00:22:09,089 - x-ratelimit-reset-tokens: 59.687s
2024-04-10 00:22:09,090 - ==============================================
2024-04-10 00:22:09,090 - 8 starts
2024-04-10 00:22:18,679 - 8 ends
2024-04-10 00:22:18,682 - b’{\n “id”: “chatcmpl-9C8WfoShWYeCKNjYgcJQV2GBlK2cA”,\n “object”: “chat.completion”,\n “created”: 1712679737,\n “model”: “gpt-3.5-turbo-0125”,\n “choices”: [\n {\n “index”: 0,\n “message”: {\n “role”: “assistant”,\n “content”: “Say this is a test”\n },\n “logprobs”: null,\n “finish_reason”: “stop”\n }\n ],\n “usage”: {\n “prompt_tokens”: 10007,\n “completion_tokens”: 5,\n “total_tokens”: 10012\n },\n “system_fingerprint”: “fp_b28b39ffa8”\n}\n’
2024-04-10 00:22:18,683 - x-ratelimit-limit-requests: 10000
2024-04-10 00:22:18,683 - x-ratelimit-limit-tokens: 60000
2024-04-10 00:22:18,684 - x-ratelimit-remaining-requests: 9991
2024-04-10 00:22:18,685 - x-ratelimit-remaining-tokens: 295
2024-04-10 00:22:18,686 - x-ratelimit-reset-requests: 1m13.309s
2024-04-10 00:22:18,686 - x-ratelimit-reset-tokens: 59.704s
2024-04-10 00:22:18,687 - ==============================================

The rate limit is not strictly “per minute” for a discrete minute, but rather continous. You can imagine a faucet of tokens filling the pool you can draw from back up at the rate as how the algorithm works. There are also aspect of penalty or burstiness that are reflected in the rates that are hard to quantify, like you can see that the time to revert to a memory-less state has increased even though the remaining tokens is larger than the first response.

Also, there’s a constant interplay of delayed responses counting against you vs the estimated size that might block you.

Thank you for your response. But in my case, I let the loop run for a long time, that is about 10k tokens every a few seconds. It’s hard to imagine that I never got an error.

Again, i let it run for 30 times in a row. I got no error, but I notice that the time it take API to return is like below:

1th call: 2s
2th call: 1s
3th call: 1s
4th call: 2s
5th call: 1s
6th call: 1s
7th call: 2s
8th call: 9s
9th call: 10s
10th call: 10s
11th call: 10s
12th call: 10s
13th call: 9s
14th call: 10s
15th call: 9s
16th call: 10s
17th call: 9s
18th call: 10s
19th call: 9s
20th call: 10s
21th call: 10s
22th call: 10s
23th call: 9s
24th call: 10s
25th call: 10s
26th call: 10s
27th call: 10s
28th call: 9s
29th call: 10s
30th call: 9s

Does this mean, instead of returning errors, the API would delay your request until you have enough token quota? That would be an interesting solution from openAI…

so my limit is 60k per minute, and every 10 seconds, I get a 10k token quota refill, which is just enough to process one request. Numbers add up!

The reset time being just under a minute can be interpreted as you sending at close to, but not exceeding, the limit. If you are sending continuous at exactly the rate limit, and you didn’t start your parallel session with a burst, you should always have a bucket refilling itself at the same rate you empty it, that has a continuous reserve of nearly its full size.

You’ll have to be dispatching many calls at once to reach the limit, async, threaded, not limited by your own software.

Independent calls being throttled, and inference performing worse based on immediate usage history, would be unexpected.

Or if you are indeed pushing double your rate limit for several minutes and never see a denial: shhh.!