With the following addition on TPM for various models:
For our older models, the **TPM** (tokens per minute) unit is different depending on the model version:
|TYPE|1 TPM EQUALS|
| --- | --- |
|davinci|1 token per minute|
|curie|25 tokens per minute|
|babbage|100 tokens per minute|
|ada|200 tokens per minute|
In practical terms, this means you can send approximately 200x more tokens per minute to an `ada` model versus a `davinci` model.
Two questions, as this is confusing:
(1) My account page shows rate limits; specifically, it shows 3,000 (RPM) and 250,000 (TPM) for the Babbage models. Is the documentation out of date?
(2) Would my Babbage TPM be 250,000 x 100 = 25,000,000 per minute? If so, I assume the only way to saturate this would be by batching prompts into a single request (given that 2,048 x 3500 doesn’t even get close to 35 million)?
The documentation isn’t always exactly correct. I would assume whatever is shown on your account rate-limit page is accurate for your API account.
Yes, your babbage limit would effectively be 25M tokens per minute. If you are limited to 3,000 RPM at 2048 tokens per prompt, without batching you are correct that you would not come close to hitting the TPM limit. That said, remember TPM counts tokens in and out, so depending on how verbose your responses are you could effectively be doubling your use rate.
Regarding #2, I agree that this is what one would expect based on the documentation. However, I just ran a bunch of tests and can confirm that I am running into a 250,000 TPM limit when submitting batched requests. So the multipliers do not appear in effect, unfortunately.
Can anyone from OpenAI confirm that this is intentional?
Brief update: after some more experimentation, it does seem like the limit is slightly higher than 250K TPM, but nowhere near close to a 100x multiplier. More like a ~2x multiplier, anything beyond that I can’t sustain before hitting rate limits.