Concurrent request restriction

george-p · December 19, 2024, 5:18pm

We are tier 5 so we have pretty high limits on RPM and TPM, However our typical request take 10 to 40 seconds each to process.

We found that we are able to process only about eight concurrent requests before getting an error message or being throttled back to a slow pace. Therefore it is absolutely impossible to get even close to the published RPM and TPM limits. As a matter of fact we cannot even meet the daily requirement for a single client and we have others stranding in line.

Also we cannot use batch processing because we have to guarantee a turn around in about 8 hours And our deliverable requires multiple passes on each piece of data meaning we need to process 3 or 4 prompts 10 to 40 seconds each before we get a finished product on a piece of data. If we rely on batch processing then this process alone could take up to four days.

10,000 RPM sounds great, but with an eight concurrent request limit we can achieve only about 20 RPM. Divide that by 4 passes puts us at about 4 to 5 finish deliverables per minute and we need to deliver 1000s per day.

Are there any workarounds for this and what are my options as this restriction is a complete show stopper on our entire business model.

BTW, This is not a limitation of my computer as I have experimented with running 50 concurrent API requests and the first 40 to 60 process extremely fast but after 60 it throttles down significantly to where it’s even slower than when I set it to eight concurrent API requests…

I know how to code for asynchronous processing as well as true parallel processing in my computer can run 20 parallel processes.

_j · December 19, 2024, 9:00pm

Reported before.

Metrics I would look at would be the time to first token, to see if requests are being held back by intermediary steps such as Cloudflare, workers to check your input limits, and other processes outside of AI generation that are impactful.

The solution to that, and for an organization’s AI model slowdown originating from a system that tries to route you to the same inference unit server for employing non-distributed context caching, may be to proxy some of the requests out of another data center and see if new IP route can deliver performance.

One can consider if Enterprise scale tier isn’t a solution that first needed a problem.

george-p · December 22, 2024, 1:22am

Thanks. Please correct me if I’m wrong but I think your response is addressing performance bottlenecks whereas the main question has to do with concurrency restrictions. If I have some delays due to various bottlenecks on the API calls my tests show that I get best performance around a maximum of eight concurrent calls. You would think I would get better throughput by increasing from 8 to 50 concurrent calls but this is not the case.

_j · December 22, 2024, 6:41am

I’m saying that the API behavior has been observed before, with no response from OpenAI, nor documentation that explains the futility of prepaying and aspiring to a service level advertised and implied.

I’m also saying that this throttling appearance - limitation on the allocation of API resources available for you to use, may be of those intermediary steps between you and OpenAI or of a particular “server”. Based on IP or geographic origin.

Imagine, for example, a low resource intermediary that has to encode tokens to measure request against your rate limit.

Such a constriction point could be measured by comparing a fast service such as embeddings, which also gives premature throughput limits.

You will be able to see if it is connection-based by using a parallel connection. Or a second organization from your origin IP.

If organization-based, we’ve seen such corporate behavior before, with slowing model generation rate on an undisclosed lower class of customers almost as an instantly-effective decision, with a flood of forum issue reports. Only to find out weeks later they had been working on a tier system that now exists in its current form of further denying service quality and rollout to those unwilling to prepay. Absolute silence from OpenAI except for temporary documentation, ‘high tier may get lower latency servers’.

You can send a message to “help” and report your API observation, common to others at tier 5. Request explanation and documentation and resolution, and hope for something beyond “restart your browser, best”.

sps · December 22, 2024, 6:49am

george-p · December 23, 2024, 4:12pm

Thanks to everyone for the replies.

OpenAI, if you are listening… This is a critical issue being reported by many. In the spirit of helping businesses become more successful using your services, it would be very helpful to provide full transparency on this in your documentation.

isidro · February 3, 2025, 11:05pm

I have the same issue. I’m tier 2, and EIGHT is also about the number of concurrent API calls I can process normally before it slowing down significantly. sometimes it’s 7, sometimes it’s 9… My requests usually take less than a second to process. when I batch together 8 or 9, the last one or two get stuck and can take 15-30 seconds, bringing the whole process from less than a second to unbearable 15-30 secs. My use case involves 12-13 concurrent requests per use so it’s been super annoying. Also, I noticed this behavior a few weeks ago, then it went away, then it’s back now. At about the same limit of 8. in both gpt 4-o and gpt 4-o-mini. I’m using python, asyncio for async task gathering… Someone mentioned doing some exploration to find out where the bottleneck may be… I don’t know how to do that… how do I check if it’s cloudflare, or something else? how do I measure time to first token and what would that tell me? I don’t have this problem using Llama on groq but they have very low actual rate limits for non-enterprise customers like me so can’t rely on them. I could also route 5 calls to openai , 5 others to groq, and 5 others to anthropic or azure or aws for example but before going down complicated routes or solutions that won’t scale if i’m able to make it to hundreds of users using simultaneously, would appreciate any help. I won’t spend $1000 to get to Tier 5 just to discover that was not the issue… Thanks a lot y’all

isidro · February 4, 2025, 11:06am

After implementing a 0.1 second space between calls this problem has decreased, and my requests are even faster now (half a second vs. 1 sec). I also implemented asyncio.wait_for with a timeout limit of 3 seconds which is still sometimes triggered, so I don’t think I have solved this sustainably…

marianstefi20 · February 7, 2025, 12:54pm

I’ve observed similar issues as well, despite being Tier 5. I made a small library in an attempt to overcome this (concurrent-openai).

It’s just a hypothesis, but I think everything depends on how OpenAI computes the usage metrics. For example, if OpenAI uses a rate-based approach, where they derive the “speed” with which you consume requests or tokens, despite being Tier 5, you might still hit their limits. For example, let’s assume you’re using GPT-4o (Tier 5 has 10,000 RPM). Suppose you’re launching 100 requests, concurrently, in 0.5 seconds. In such a scenario, the rate is 100 / 0.5 = 200 RPS. In other words, because you passed the 10,000/60 = 166.66 RPS speed they might throttle you. Same for tokens as well.

Also, try increasing the request_timeout if not using streaming mode.

georgep · February 7, 2025, 4:34pm

I have seen a lot of people comment about this and it is truly a problem that can affect the success of a business. I find it really odd the OpenAI does not provide documentation on this so we can best navigate this constraint.

OpenAI, please provide documentation of this!!!

Topic		Replies	Views
Gpt-3.5 concurrent requests limit API	3	5970	February 2, 2024
My request are getting throttled back API gpt-4 , api-rate-limits	0	112	November 21, 2024
Parallel API Requests - Very Long Response Times API	7	3212	August 18, 2024
Chat GPT's API is significantly slower than the website with GPT Plus API	35	36772	December 12, 2023
Request time out for multiple request API	2	499	May 27, 2025

Concurrent request restriction

Related topics