Parallel API Requests - Very Long Response Times

jamesmalin · April 10, 2024, 6:34pm

Hi, I am wondering if anyone else has solved this issue.

I love OpenAI and will continue to use it, but I am experiencing problem with running requests in parallel. I have two functions that are interchangeable for handling large sets of data in batch requests. In batches of 100 requests I am testing with Claude 3 and ChatGPT-4. With one off requests while there is a difference (I still prefer ChatGPT for these), I am mainly concerned about requests in parallel.

100 requests in parallel:
Claude 3 Sonnet - processing time: 18.963s
ChatGPT-4 1106 - processing time: 5:50.650 (m:ss.mmm)
ChatGPT-3.5-turbo-1106 - processing time: 9.852s

I have tested with 0125 and others as well, but they still have the same degree of latency. I also tested directly with OpenAI and with Azure OpenAI.

Is there some type of parameter setting - I never read that in the docs? It doesn’t matter how I call the API in parallel (locally vs many lambda calls) I still receive similar results (minor additional latency adding lambda as expected).

I am well within RPM, TPM, and any other limits. Why wouldn’t this return all within a similar time frame of a single request if inside the limits? Do they process one at a time with your account or API Key? Is there some other nerfing that happens? Documentation links please.

jamesmalin · April 10, 2024, 9:08pm

Are you talking about the running cores for the OpenAI model? Is this set at the account level?

I have run the tests in parallel across lambda (runs in separate containers in parallel) and it has this issue which is not impacted by memory, cores, etc.

From the benchmarks I provided it also is not impacted with different models (GPT-3.5, Claude, etc), so if you’re referring to the machines I am running locally or in the cloud that is not the bottleneck in this case.

codelahoma-atlasup · April 11, 2024, 1:55pm

My bad. Got my forums mixed up.

Nothing to see here.

tlong · July 26, 2024, 2:53pm

I’m also having this issue - making a single API request returns very quickly, but when I make a large number of parallel requests the latency increases significantly (i.e. goes from a few seconds to a few minutes).

tom.tyiu · July 26, 2024, 5:14pm

jamesmalin

You can try using “batch” API which is 50% lower
Batch

Create large batches of API requests for asynchronous processing. The Batch API returns completions within 24 hours for a 50% discount.

jamesmalin · July 31, 2024, 6:28pm

Hi, thanks Tom. I did indeed switch to using this when it was announced, which was only 5 days after this post (posted April 10, they release April 15) – timely! The only caveat is waiting up to 24 hours. Either way it’s great because I had batches that took literally days to process. Good option if you have massive amounts of data to process.

Munna23 · July 31, 2024, 9:15pm

Hey @jamesmalin - Did you get a chance to look at Scale Tier for APIs. You can specify the service tier in your completions. Hope this helps. Cheers!

jamesmalin · August 18, 2024, 9:04pm

I have not, but might need to look into this further! It looks like the minimum is $5,000 (30 days) when including both input/output, but there are times this would actually be feasible depending on use case.

For most of our workloads it’s okay to wait for the 24-hours and submit multiple batches. It certainly beats having a machine run for days at a time.

But, I would need to crunch the numbers to compare. It’s also good to know for use cases when we want results immediately to start working on the datasets. Thanks.

Update:
Case 1 - Large amounts of data processing and okay with waiting
I just checked and for the use case of needing to just process large amounts of data and you’re okay with waiting for the response – batch API wins.
gpt4o batch: $2.50 / 1M input tokens; $7.50 / 1M output tokens
gpt4o scale tier: $2.90 / 1M input tokens; 10.92 / 1M output tokens (running 24 hours per day)
Winner: batch processing

Case 2 - Large amounts of requests, need responses faster (e.g. realtime user interactions)
Winner: scale tier

Topic		Replies	Views
Issues with Rate Limiting and Batch Processing in OpenAI API Community api , batching	0	1781	November 11, 2023
Latency increases with more parallel requests API	1	61	October 21, 2024
Variable Response Times in Concurrent API Calls with OpenAI's ChatCompletion API API gpt-4o-mini	0	56	October 3, 2024
Simultaneous Requests - API API	5	4561	June 3, 2023
[GPT-3.5-Turbo-16k] Response generation is slower now for Function Calls API gpt-35-turbo , function-calling	9	2910	October 13, 2023

Parallel API Requests - Very Long Response Times

Related topics