Hi, I am wondering if anyone else has solved this issue.
I love OpenAI and will continue to use it, but I am experiencing problem with running requests in parallel. I have two functions that are interchangeable for handling large sets of data in batch requests. In batches of 100 requests I am testing with Claude 3 and ChatGPT-4. With one off requests while there is a difference (I still prefer ChatGPT for these), I am mainly concerned about requests in parallel.
I have tested with 0125 and others as well, but they still have the same degree of latency. I also tested directly with OpenAI and with Azure OpenAI.
Is there some type of parameter setting - I never read that in the docs? It doesn’t matter how I call the API in parallel (locally vs many lambda calls) I still receive similar results (minor additional latency adding lambda as expected).
I am well within RPM, TPM, and any other limits. Why wouldn’t this return all within a similar time frame of a single request if inside the limits? Do they process one at a time with your account or API Key? Is there some other nerfing that happens? Documentation links please.
Are you talking about the running cores for the OpenAI model? Is this set at the account level?
I have run the tests in parallel across lambda (runs in separate containers in parallel) and it has this issue which is not impacted by memory, cores, etc.
From the benchmarks I provided it also is not impacted with different models (GPT-3.5, Claude, etc), so if you’re referring to the machines I am running locally or in the cloud that is not the bottleneck in this case.
I’m also having this issue - making a single API request returns very quickly, but when I make a large number of parallel requests the latency increases significantly (i.e. goes from a few seconds to a few minutes).
Hi, thanks Tom. I did indeed switch to using this when it was announced, which was only 5 days after this post (posted April 10, they release April 15) – timely! The only caveat is waiting up to 24 hours. Either way it’s great because I had batches that took literally days to process. Good option if you have massive amounts of data to process.
I have not, but might need to look into this further! It looks like the minimum is $5,000 (30 days) when including both input/output, but there are times this would actually be feasible depending on use case.
For most of our workloads it’s okay to wait for the 24-hours and submit multiple batches. It certainly beats having a machine run for days at a time.
But, I would need to crunch the numbers to compare. It’s also good to know for use cases when we want results immediately to start working on the datasets. Thanks.
Update:
Case 1 - Large amounts of data processing and okay with waiting
I just checked and for the use case of needing to just process large amounts of data and you’re okay with waiting for the response – batch API wins.
gpt4o batch: $2.50 / 1M input tokens; $7.50 / 1M output tokens
gpt4o scale tier: $2.90 / 1M input tokens; 10.92 / 1M output tokens (running 24 hours per day)
Winner: batch processing
Case 2 - Large amounts of requests, need responses faster (e.g. realtime user interactions)
Winner: scale tier