In my account it says 90,000 TPM. Also, I’m using gpt-4 for all of my requests. Since this model allows only 40,000 TPM, would a potential rate-limiting mitigation strategy be to switch to a different gpt-4 level model such as gpt-4-0314 to give my organization access to an additional 40,000 TPM? Or, is the 40,000 TPM limit for any of the gpt-4 models?
Thanks so much!
Hi and welcome to the Developer Forum!
The rate limits are per organisation, so swapping models would not help in this situation. You’re not the first to have thought of it
I understand that the rates are per organization, but if I have 90,000 TPM for chat models for my organization, can I use up 40,000 TPM for gpt-4 and then use up another 40,000 TPM for a different gpt-4 model (gpt-4-0314). Does that make more sense?
The rate limits cannot be cross-model, as there are different TPM across different classes of model.
You can look at the decreasing rate limit count in the headers of requests to see which specific models would affect the counts of others.
Yes, it makes perfect sense, but the Rate limit will still kick in at 40k regardless of if you swap model from “gpt-4” to “gpt-0314” or any variation thereof. The whole point of rate limits is to ensure the system remains performant for everyone, so there is a lot of load balancing and careful allocation of new resources to new applications going on, bypassing this would cause problems for everyone.
Got it! And if I switch to a different model altogether such as 3.5, would I able to use those additional TPM up to my total allotted 90K TPM?
I’ve not actually tried, but that would seem reasonable and in keeping with the documentation.