Hi I am sending 150 parallel requests to OpenAI using jmeter (well below my rate limits). The latency is just 3s for the requests at first 50 or so request however it is gradually increasing and towards the end it changes to as high as 6s. Here are the screenshots is there any limit that decides how many parallel request can a user run at one time?
OpenAI offers no information on the subject. We can only guess, but similar anecdote has been reported.
The requests to the API go through Cloudflare firewall, sometimes returning its 503 gateway error. The connections given to a particular IP address may be a limitation, but an AI model is low bandwidth.
One limitation may be the AI platform architecture: repeated calls to models that support context window cache have fulfillment by compute units on the same server optimally, as OpenAI doesn’t use a distributed database backend for API to permanently store that cost-reducing cache. Therefore you may be looking at compute limitations being found by the techniques of prompt cache and other server-pinnings to organization.
Your software must support that much concurrency without blocking - like writing it in Go.