In my account it says 90,000 TPM. Also, I’m using gpt-4 for all of my requests. Since this model allows only 40,000 TPM, would a potential rate-limiting mitigation strategy be to switch to a different gpt-4 level model such as gpt-4-0314 to give my organization access to an additional 40,000 TPM? Or, is the 40,000 TPM limit for any of the gpt-4 models?
I understand that the rates are per organization, but if I have 90,000 TPM for chat models for my organization, can I use up 40,000 TPM for gpt-4 and then use up another 40,000 TPM for a different gpt-4 model (gpt-4-0314). Does that make more sense?
Yes, it makes perfect sense, but the Rate limit will still kick in at 40k regardless of if you swap model from “gpt-4” to “gpt-0314” or any variation thereof. The whole point of rate limits is to ensure the system remains performant for everyone, so there is a lot of load balancing and careful allocation of new resources to new applications going on, bypassing this would cause problems for everyone.