We are tier 5 so we have pretty high limits on RPM and TPM, However our typical request take 10 to 40 seconds each to process.
We found that we are able to process only about eight concurrent requests before getting an error message or being throttled back to a slow pace. Therefore it is absolutely impossible to get even close to the published RPM and TPM limits. As a matter of fact we cannot even meet the daily requirement for a single client and we have others stranding in line.
Also we cannot use batch processing because we have to guarantee a turn around in about 8 hours And our deliverable requires multiple passes on each piece of data meaning we need to process 3 or 4 prompts 10 to 40 seconds each before we get a finished product on a piece of data. If we rely on batch processing then this process alone could take up to four days.
10,000 RPM sounds great, but with an eight concurrent request limit we can achieve only about 20 RPM. Divide that by 4 passes puts us at about 4 to 5 finish deliverables per minute and we need to deliver 1000s per day.
Are there any workarounds for this and what are my options as this restriction is a complete show stopper on our entire business model.
BTW, This is not a limitation of my computer as I have experimented with running 50 concurrent API requests and the first 40 to 60 process extremely fast but after 60 it throttles down significantly to where it’s even slower than when I set it to eight concurrent API requests…
I know how to code for asynchronous processing as well as true parallel processing in my computer can run 20 parallel processes.
Metrics I would look at would be the time to first token, to see if requests are being held back by intermediary steps such as Cloudflare, workers to check your input limits, and other processes outside of AI generation that are impactful.
The solution to that, and for an organization’s AI model slowdown originating from a system that tries to route you to the same inference unit server for employing non-distributed context caching, may be to proxy some of the requests out of another data center and see if new IP route can deliver performance.
One can consider if Enterprise scale tier isn’t a solution that first needed a problem.
Thanks. Please correct me if I’m wrong but I think your response is addressing performance bottlenecks whereas the main question has to do with concurrency restrictions. If I have some delays due to various bottlenecks on the API calls my tests show that I get best performance around a maximum of eight concurrent calls. You would think I would get better throughput by increasing from 8 to 50 concurrent calls but this is not the case.
I’m saying that the API behavior has been observed before, with no response from OpenAI, nor documentation that explains the futility of prepaying and aspiring to a service level advertised and implied.
I’m also saying that this throttling appearance - limitation on the allocation of API resources available for you to use, may be of those intermediary steps between you and OpenAI or of a particular “server”. Based on IP or geographic origin.
Imagine, for example, a low resource intermediary that has to encode tokens to measure request against your rate limit.
Such a constriction point could be measured by comparing a fast service such as embeddings, which also gives premature throughput limits.
You will be able to see if it is connection-based by using a parallel connection. Or a second organization from your origin IP.
If organization-based, we’ve seen such corporate behavior before, with slowing model generation rate on an undisclosed lower class of customers almost as an instantly-effective decision, with a flood of forum issue reports. Only to find out weeks later they had been working on a tier system that now exists in its current form of further denying service quality and rollout to those unwilling to prepay. Absolute silence from OpenAI except for temporary documentation, ‘high tier may get lower latency servers’.
You can send a message to “help” and report your API observation, common to others at tier 5. Request explanation and documentation and resolution, and hope for something beyond “restart your browser, best”.