Batching with ChatCompletion Endpoint


Ever since OpenAI introduced the model gpt-3.5-turbo, aka ChatGPT, to the OpenAI API on the Chat Completions endpoint, there has been an effort to replicate “batching” from existing users of the completions endpoint migrating to ChatCompletions - owing to the economical pricing.

In the scope of this tutorial, we refer to combining multiple completion requests irrespective of the contexts into a single API call as batching.

Why use batching?

Instead of explaining this, I’ll quote from OpenAI docs:

The OpenAI API has separate limits for requests per minute and tokens per minute.

If you’re hitting the limit on requests per minute , but have available capacity on tokens per minute, you can increase your throughput by batching multiple tasks into each request. This will allow you to process more tokens per minute, especially with our smaller models.

Sending in a batch of prompts works exactly the same as a normal API call, except you pass in a list of strings to the prompt parameter instead of a single string.

What’s the catch?

The above technique works great with completions endpoint, however when it comes to chat completion endpoint, this technique doesn’t work, because the chat completions endpoint doesn’t take an array of prompts, it takes an array of messages.

Here’s how an array of messages looks like:

        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}

The Solution

We have to somehow pass multiple prompts to the chat completions endpoint in the message object.

For this purpose I chose a string array.

But we cannot pass a string array to the messages, nor can we pass it to the content attribute of the message object.

So we stringify the array of strings containing the prompts and pass it to the message object with role set to user.

Now that we have the prompts, we need to tell the model what to do with these.

This is done using system role message which tells the model to complete individual elements of the array and return them as an array.


The system message must be appended at the end of message array. It tried to use it in the beginning of the array of message objects, and it didn’t reply consistently.


Here’s a basic python code to send batch requests to chat completion endpoint and get the completed array in the response.

import openai
import json
openai.api_key = "OPENA_API_KEY"  # supply your API key however you choose

promptsArray = ["Hello world, from", "How are you B", "I am fine. W", "The  fifth planet from the Sun is "]

stringifiedPromptsArray = json.dumps(promptsArray)


prompts = [
    "role": "user",
    "content": stringifiedPromptsArray

batchInstruction = {
    "Complete every element of the array. Reply with an array of all completions."

print("ChatGPT: ")
stringifiedBatchCompletion = openai.ChatCompletion.create(model="gpt-3.5-turbo",
batchCompletion = json.loads(stringifiedBatchCompletion.choices[0].message.content)

Continued in comments.



  • The promptsArray contains all the prompts that will be processed in a batch, as individual elements of a string array.

  • The promptsArray is then converted to a string using JSON.stringify() and stored in stringifiedPromptsArray, which will be used as the content of the user’s message.

  • batchInstruction is a system message that directs the chat completion model to complete every prompt in stringifiedPromptsArray and return an array of completions.

  • The chat completion is obtained from the response and converted back into an array of strings using json.loads(). The individual completions can then be easily accessed from batchCompletion


['Hello world, from', 'How are you B', 'I am fine. W', 'The  fifth planet from the Sun is ']
['Hello world, from Earth', 'How are you Bob', 'I am fine. What about you?', 'The fifth planet from the Sun is Jupiter']


  • The max_tokens doesn’t control the max_tokens for individual prompts; instead max_tokens limits the total amount of tokens per request.
  • Length of one completion can influence other completions in the batch. If one completion is longer than expected, other completions may get truncated, even the array might not turn out to be valid.

could also check out reliableGPT for this - python package to handle batch calls to openai