We are observing some irregularites in API response latency, despite being well within our Tier 5 rate limits. We’d love to better understand how the concurrency rate limits actually work, as they have very tangible, real impacts on our users’ experience with us.
The irregularity we are observing is, when sending multiple API requests in parallel to gpt-4o-mini (138 requests in parallel; total token count ~15000 and roughly ~100 tokens/request), we have a return latency of 40 seconds. This is far below the REQUEST limit for Tier 5 of 30,000 RPM. When we do only 5 API requests (total token count is 638), the total time is 1.7 seconds. There should be no reason we can see that the first Request takes so much longer.
I conducted a more thorough investigation - we have various translation tasks we’re interested in doing, and we’re decomposing translating an entire long document into translating multiple paragraphs in parallel.
The most obvious conclusion is there is some kind of invisible additional rate limiting going on behind the scenes beyond what is outlined in the guide.
Per the guide, there is a cap of 150 M tokens/minute and 30,000 requests/minute. We don’t come anywhere close to that in my tests, but there is a dramatic difference in total processing speed, presumably because of some unpredictable concurrency throttling.
As we are about to adopt this system in production, on a wide scale with many customers, with potentially thousands of requests happening simultaneously, it’s crucial for us to have more clarity on how exactly the rate limiting is happening.
When I asked support for more specific numbers on concurrency limits, so we can predict its impact and make plans to solve it (such as using other solutions like AWS), or to be connected with a sales representative to develop a custom solution for our enterprise, I was in effect told our organization needs to spend $10,000 USD/month to be qualified to speak with a sales agent.
Is there any way forward here for us? We’d love to keep developing with this platform, but it’s hard for us to navigate blind to the concurrency limits.