Parallel API Requests - Very Long Response Times

Hi, I am wondering if anyone else has solved this issue.

I love OpenAI and will continue to use it, but I am experiencing problem with running requests in parallel. I have two functions that are interchangeable for handling large sets of data in batch requests. In batches of 100 requests I am testing with Claude 3 and ChatGPT-4. With one off requests while there is a difference (I still prefer ChatGPT for these), I am mainly concerned about requests in parallel.

100 requests in parallel:
Claude 3 Sonnet - processing time: 18.963s
ChatGPT-4 1106 - processing time: 5:50.650 (m:ss.mmm)
ChatGPT-3.5-turbo-1106 - processing time: 9.852s

I have tested with 0125 and others as well, but they still have the same degree of latency. I also tested directly with OpenAI and with Azure OpenAI.

Is there some type of parameter setting - I never read that in the docs? It doesn’t matter how I call the API in parallel (locally vs many lambda calls) I still receive similar results (minor additional latency adding lambda as expected).

I am well within RPM, TPM, and any other limits. Why wouldn’t this return all within a similar time frame of a single request if inside the limits? Do they process one at a time with your account or API Key? Is there some other nerfing that happens? Documentation links please.

2 Likes

Are you talking about the running cores for the OpenAI model? Is this set at the account level?

I have run the tests in parallel across lambda (runs in separate containers in parallel) and it has this issue which is not impacted by memory, cores, etc.

From the benchmarks I provided it also is not impacted with different models (GPT-3.5, Claude, etc), so if you’re referring to the machines I am running locally or in the cloud that is not the bottleneck in this case.

1 Like

My bad. Got my forums mixed up.

Nothing to see here.

I’m also having this issue - making a single API request returns very quickly, but when I make a large number of parallel requests the latency increases significantly (i.e. goes from a few seconds to a few minutes).

jamesmalin

You can try using “batch” API which is 50% lower
Batch

Create large batches of API requests for asynchronous processing. The Batch API returns completions within 24 hours for a 50% discount.

1 Like

Hi, thanks Tom. I did indeed switch to using this when it was announced, which was only 5 days after this post (posted April 10, they release April 15) – timely! The only caveat is waiting up to 24 hours. Either way it’s great because I had batches that took literally days to process. Good option if you have massive amounts of data to process.

1 Like

Hey @jamesmalin - Did you get a chance to look at Scale Tier for APIs. You can specify the service tier in your completions. Hope this helps. Cheers!

1 Like

I have not, but might need to look into this further! It looks like the minimum is $5,000 (30 days) when including both input/output, but there are times this would actually be feasible depending on use case.

For most of our workloads it’s okay to wait for the 24-hours and submit multiple batches. It certainly beats having a machine run for days at a time.

But, I would need to crunch the numbers to compare. It’s also good to know for use cases when we want results immediately to start working on the datasets. Thanks.


Update:
Case 1 - Large amounts of data processing and okay with waiting
I just checked and for the use case of needing to just process large amounts of data and you’re okay with waiting for the response – batch API wins.
gpt4o batch: $2.50 / 1M input tokens; $7.50 / 1M output tokens
gpt4o scale tier: $2.90 / 1M input tokens; 10.92 / 1M output tokens (running 24 hours per day)
Winner: batch processing

Case 2 - Large amounts of requests, need responses faster (e.g. realtime user interactions)
Winner: scale tier

2 Likes