The ChatCompletion response limit headers do not reflect previous request or token usage



OpenAI sends response headers with every chatgpt completion it sends back.
According to the doc here

You can use these headers to determine how many prompt tokens are remaining for the day, or how many requests are remaining for the current minute.

The problem is that this count keeps resetting to the maximum amount of requests or tokens minus 1 or minus prompt tokens used for the current request.

So for example:

Say i have a prompt that translates to 50 tokens. My daily allowed token usage is 500000. My Request per minute is 5000.

When i send the prompt i’ll get the value 4999 for the header x-ratelimit-remaining-requests and 499950 remaining tokens for the header x-ratelimit-limit-tokens. But when i send another request immediately after, i’ll get the same values back. Instead of 4998 requests left it will return 4999 again, and if the token usage for the second prompt is also 50 tokens the response header for that metric will also be 499950 again, instead of 499900.

Clearly the OpenAI api does not keep a count of how many requests you’ve actually sent, so these headers are absolutely usesless.

1 Like

Don’t view the minutes as discrete minutes, and then you get a better understanding.

5000 RPM = 83.33 RPS = one request per 12.00ms

The memory of that request can be expired within 12ms back to your original remaining amount. You also see that in the “reset” values.

Then if you have 5 million token per day (such as tier 3: gpt-4-1106-preview @ 5,000RPM - 300,000 TPM - 5,000,000 TPD) consider:

5M/day = 57.87 TPS = one token per 17.28ms

So you have a particular rate that your usage “empties” until ultimately resetting without memory. Or accrues greater than the empty rate until you face API denial.