Latency increases with more parallel requests

iamtarun24 · October 21, 2024, 7:40am

Hi I am sending 150 parallel requests to OpenAI using jmeter (well below my rate limits). The latency is just 3s for the requests at first 50 or so request however it is gradually increasing and towards the end it changes to as high as 6s. Here are the screenshots is there any limit that decides how many parallel request can a user run at one time?

_j · October 21, 2024, 8:29am

OpenAI offers no information on the subject. We can only guess, but similar anecdote has been reported.

The requests to the API go through Cloudflare firewall, sometimes returning its 503 gateway error. The connections given to a particular IP address may be a limitation, but an AI model is low bandwidth.

One limitation may be the AI platform architecture: repeated calls to models that support context window cache have fulfillment by compute units on the same server optimally, as OpenAI doesn’t use a distributed database backend for API to permanently store that cost-reducing cache. Therefore you may be looking at compute limitations being found by the techniques of prompt cache and other server-pinnings to organization.

Your software must support that much concurrency without blocking - like writing it in Go.

Topic		Replies	Views
Parallel API Requests - Very Long Response Times API	7	1733	August 18, 2024
Gpt-3.5 concurrent requests limit API	3	4876	February 2, 2024
Simultaneous Requests - API API	5	4562	June 3, 2023
Gpt-4o-2024-08-06 randomly fails to cache tokens API prompt-caching	7	67	November 12, 2024
Request time out for multiple request API	1	400	January 10, 2024

Latency increases with more parallel requests

Related topics