Parallelise calls to the API - is it possible and how?

I have a task for which I need to call the completions API several thousand times and expect this to take several hours.

I don’t need help with optimising the prompt itself, but to understand what’s the most recommended way to parallelise (if this is allowed) the calls to the API so that the task completes a little faster.

Also, could anyone confirm if there is a limit to the number of concurrent calls? Some information I’ve found on the internet suggests that only up to two concurrent requests are possible.

Any other solution that could help complete the task faster with respect to making the API requests would be appreciated. Thanks!


Hey, welcome to the forum.

There’s no way to send multiple calls at the same time, as far as I know.

There’s also rate limits in place.

If you need to access more frequently, I would reach out to OpenAI chat support and ask…

Hope this helps!

1 Like

@alexpapageo : Did you find a solution on how to send concurrent/parallel requests to open AI LLM

In python

The doc is always showing

create …

Just use a create wich is asynchronous

So you can use asyncio to make parallel calls

Be aware of quota
Depending on your prompt Len you should use ratelimit library

And keep in mind that you have to divide the official quota by 2 in order to not have errors

And implement retry in case of errors
(1 retry fix 90% of the 1 on 100 errors, 2 retries fix 100%)

1 Like

Hello, it turns out that the OpenAI’s completion call can automatically handle the multiple request and all runs in parallel.

So no need to use python’s native async libraries!

prompt_1 = "This is the prompt - 1 to run separately"
prompt_2 = "This is the prompt - 2 to run separately"
prompt_3 = "This is the prompt - 3 to run separately"

prompts = [prompt_1,  prompt_2, prompt_3]

response = openai.Completion.create(

# Print the responses
for choice in response.choices:
    print(choice.text, end="\n---\n")

:warning: Warning: the response object may not return completions in the order of the prompts, so always remember to match responses back to prompts using the index field.

That’s it! It all runs in parallel!
:link: Documentation :point_right:t2:

1 Like

That documentation:

Sending in a batch of prompts works exactly the same as a normal API call, except you pass in a list of strings to the prompt parameter instead of a single string.

Completion models take strings.

ChatCompletion model endpoints instead take a list of dictionaries.

A list of a list of dictionaries is not permitted. Trial:

    "role": "system",
    "content": """You are an assistant.""",
    "role": "user",
    "content": "What is the capitol of California?",
    "role": "system",
    "content": """You are an assistant.""",
    "role": "user",
    "content": "What is the capitol of New York?",

= fail:

“API Error: [{‘role’: ‘system’, ‘content’: ‘You are an assistant.’}, {‘role’: ‘user’, ‘content’: ‘What is the capitol of California?’}] is not of type ‘object’ - ‘messages.0’”


Yeah, thanks for letting us know!
It will be so helpful!

I’m following an example from the official cookbook to get some ideas to implement parallelization.


1 Like

So multiple prompting with the ChatCompletion API is officially not supported? it would speed some applications quite a bit

I usually do 3 calls at the same time from AWS lambda. It will invoke multiple calls for you. The only blocking would be hitting your rate limits. So all you need to do is workaround the synchronous call limitation in the API by calling multiple synchronous calls at once.


Edit: Mon 22nd Apr 2024

Helloo, I recently built a much cleaner version of the parallel processor. It has a streamlit front-end, and there is a demo video and setup instructions in the README.

You can find it here: GitHub - tiny-rawr/bulk-gpt-task-parallel-processor: Bulk write thousands of profiles based on scraped data in under 2 minutes.

Hope that helps! :smiley:

(links are in codeblocks because I can’t post with links in my reply)

I wrote a script that processes 1000 gpt model requests in 2m, by adapting their parallel processing cookbook example for the embeddings model, to the chat completions model (

All of the code for it is here plus run instructions in the README:

I’m currently working on making it easier to run and reuse, but it works great. You’ll need to change the max_tokens_per_minute and max_requests_per_minute to match your current usage tier. I’m on tier 2.


Thanks a lot for your input @becca9941 . I looked at your repo and also oppened an issue.

At the moment i can not grasp fully what you are doing in your repo, but it does not seem you do have a real connection to the parallelization code provieded from the examples file. Am i right?

As i write this my code is doing several hundred api requests a minute. I hit some unknown rate limit yesterday that is not even mentioned anywhere in the docs bcoz i was definitely doing between 500-1000 requests per minute but after a while the system started tellimg me I am hitting rate limits despite no where being closes to tokens or requests per minute boundary. Officially there is no requests per day mentioned for gpt3.5-0613 but i think internally there is one. A lot of the requests end in bad gateway error.