Is the Rate Limit tied to the API or can be cumulative if I have two models deployed?

I have two models deployed in Azure OpenAI namely:

  • gpt-4 with rate limit 20k TPM
  • gpt-4-32k with rate limit 60k TPM

When my code reaches the rate limit using gpt-4, it fallsback to use gpt-4-32k.

This way, am I effectivly getting 80k (20k + 60K) token limit?

The rate limit is tied to the API, not the model. So, even if you fall back to using a different model, you will still be limited to the same rate limit. The rate limit is enforced on a per-deployment basis, so each deployment of a model has its own separate rate limit.