Concurrent request restriction

We are tier 5 so we have pretty high limits on RPM and TPM, However our typical request take 10 to 40 seconds each to process.

We found that we are able to process only about eight concurrent requests before getting an error message or being throttled back to a slow pace. Therefore it is absolutely impossible to get even close to the published RPM and TPM limits. As a matter of fact we cannot even meet the daily requirement for a single client and we have others stranding in line.

Also we cannot use batch processing because we have to guarantee a turn around in about 8 hours And our deliverable requires multiple passes on each piece of data meaning we need to process 3 or 4 prompts 10 to 40 seconds each before we get a finished product on a piece of data. If we rely on batch processing then this process alone could take up to four days.

10,000 RPM sounds great, but with an eight concurrent request limit we can achieve only about 20 RPM. Divide that by 4 passes puts us at about 4 to 5 finish deliverables per minute and we need to deliver 1000s per day.

Are there any workarounds for this and what are my options as this restriction is a complete show stopper on our entire business model.

BTW, This is not a limitation of my computer as I have experimented with running 50 concurrent API requests and the first 40 to 60 process extremely fast but after 60 it throttles down significantly to where itā€™s even slower than when I set it to eight concurrent API requestsā€¦

I know how to code for asynchronous processing as well as true parallel processing in my computer can run 20 parallel processes.

Reported before.

Metrics I would look at would be the time to first token, to see if requests are being held back by intermediary steps such as Cloudflare, workers to check your input limits, and other processes outside of AI generation that are impactful.

The solution to that, and for an organizationā€™s AI model slowdown originating from a system that tries to route you to the same inference unit server for employing non-distributed context caching, may be to proxy some of the requests out of another data center and see if new IP route can deliver performance.

One can consider if Enterprise scale tier isnā€™t a solution that first needed a problem.

Thanks. Please correct me if Iā€™m wrong but I think your response is addressing performance bottlenecks whereas the main question has to do with concurrency restrictions. If I have some delays due to various bottlenecks on the API calls my tests show that I get best performance around a maximum of eight concurrent calls. You would think I would get better throughput by increasing from 8 to 50 concurrent calls but this is not the case.

1 Like

Iā€™m saying that the API behavior has been observed before, with no response from OpenAI, nor documentation that explains the futility of prepaying and aspiring to a service level advertised and implied.

Iā€™m also saying that this throttling appearance - limitation on the allocation of API resources available for you to use, may be of those intermediary steps between you and OpenAI or of a particular ā€œserverā€. Based on IP or geographic origin.

Imagine, for example, a low resource intermediary that has to encode tokens to measure request against your rate limit.

Such a constriction point could be measured by comparing a fast service such as embeddings, which also gives premature throughput limits.

You will be able to see if it is connection-based by using a parallel connection. Or a second organization from your origin IP.

If organization-based, weā€™ve seen such corporate behavior before, with slowing model generation rate on an undisclosed lower class of customers almost as an instantly-effective decision, with a flood of forum issue reports. Only to find out weeks later they had been working on a tier system that now exists in its current form of further denying service quality and rollout to those unwilling to prepay. Absolute silence from OpenAI except for temporary documentation, ā€˜high tier may get lower latency serversā€™.

You can send a message to ā€œhelpā€ and report your API observation, common to others at tier 5. Request explanation and documentation and resolution, and hope for something beyond ā€œrestart your browser, bestā€.

1 Like

Thanks to everyone for the replies.

OpenAI, if you are listeningā€¦ This is a critical issue being reported by many. In the spirit of helping businesses become more successful using your services, it would be very helpful to provide full transparency on this in your documentation.

I have the same issue. Iā€™m tier 2, and EIGHT is also about the number of concurrent API calls I can process normally before it slowing down significantly. sometimes itā€™s 7, sometimes itā€™s 9ā€¦ My requests usually take less than a second to process. when I batch together 8 or 9, the last one or two get stuck and can take 15-30 seconds, bringing the whole process from less than a second to unbearable 15-30 secs. My use case involves 12-13 concurrent requests per use so itā€™s been super annoying. Also, I noticed this behavior a few weeks ago, then it went away, then itā€™s back now. At about the same limit of 8. in both gpt 4-o and gpt 4-o-mini. Iā€™m using python, asyncio for async task gatheringā€¦ Someone mentioned doing some exploration to find out where the bottleneck may beā€¦ I donā€™t know how to do thatā€¦ how do I check if itā€™s cloudflare, or something else? how do I measure time to first token and what would that tell me? I donā€™t have this problem using Llama on groq but they have very low actual rate limits for non-enterprise customers like me so canā€™t rely on them. I could also route 5 calls to openai , 5 others to groq, and 5 others to anthropic or azure or aws for example but before going down complicated routes or solutions that wonā€™t scale if iā€™m able to make it to hundreds of users using simultaneously, would appreciate any help. I wonā€™t spend $1000 to get to Tier 5 just to discover that was not the issueā€¦ Thanks a lot yā€™all

After implementing a 0.1 second space between calls this problem has decreased, and my requests are even faster now (half a second vs. 1 sec). I also implemented asyncio.wait_for with a timeout limit of 3 seconds which is still sometimes triggered, so I donā€™t think I have solved this sustainablyā€¦