Has anyone gotten any more detail on how the rate limiting for the Chat Completion api works?
Using gtp-3.5-turbo, I’m regularly hitting 429 errors (which include my current usage which appears close to the limit, so it’s definitely org-level rate limiting, not global) even though I never exceed 30% of my 90k/minute limit (based on request start times).
I’ve tried a bunch of strategies around scheduling, backoffs and retries but the closest I’ve managed to get to the 90k TPM limit is about 30k TPM before I start seeing cascading failures.
From what I can work out, one of three things might be happening:
maybe the usage is calculated at the end of each request, which would make logical sense since the response length isn’t known upfront, but that seems pretty janky when a lot of api calls can take 20-40 seconds. It looks like when I fire off a bunch of calls over a minute sometimes they all return at nearly the same time causing a spike in quota and I have to pause for about a minute before continuing. this is unpredictable and would make the optimal throughput strategy is one where I severly limit the risk of hitting the rate limit, capping throughput at around 30k TPM. Anecdotally, this is my gut feeling - errors seem to correlate to a flood of responses close together, even though my request frequency is very consistent, as is my request/response size.
could I be using higher TPM than I think I am? I am calculating from
usage.totalin the responses. gpt-3.5-turbo doesn’t have an explicit “TPM unit” conversion factor in the docs, so I’ve been assuming it’s just 1.
am i wrong about where/when the usage quota is incremented? if the rate limiting is calculated based on max_tokens in the request then that would explain a good chunk of the difference between the 90k limit and the 30k I’m achieving but not all of it.