Batching with ChatCompletion Endpoint

Intro

Ever since OpenAI introduced the model gpt-3.5-turbo, aka ChatGPT, to the OpenAI API on the Chat Completions endpoint, there has been an effort to replicate “batching” from existing users of the completions endpoint migrating to ChatCompletions - owing to the economical pricing.

In the scope of this tutorial, we refer to combining multiple completion requests irrespective of the contexts into a single API call as batching.

Why use batching?

Instead of explaining this, I’ll quote from OpenAI docs:

The OpenAI API has separate limits for requests per minute and tokens per minute.

If you’re hitting the limit on requests per minute , but have available capacity on tokens per minute, you can increase your throughput by batching multiple tasks into each request. This will allow you to process more tokens per minute, especially with our smaller models.

Sending in a batch of prompts works exactly the same as a normal API call, except you pass in a list of strings to the prompt parameter instead of a single string.

What’s the catch?

The above technique works great with completions endpoint, however when it comes to chat completion endpoint, this technique doesn’t work, because the chat completions endpoint doesn’t take an array of prompts, it takes an array of messages.

Here’s how an array of messages looks like:

messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
    ]

The Solution

We have to somehow pass multiple prompts to the chat completions endpoint in the message object.

For this purpose I chose a string array.

But we cannot pass a string array to the messages, nor can we pass it to the content attribute of the message object.

So we stringify the array of strings containing the prompts and pass it to the message object with role set to user.

Now that we have the prompts, we need to tell the model what to do with these.

This is done using system role message which tells the model to complete individual elements of the array and return them as an array.

Note:

The system message must be appended at the end of message array. It tried to use it in the beginning of the array of message objects, and it didn’t reply consistently.

Code:

Here’s a basic python code to send batch requests to chat completion endpoint and get the completed array in the response.

import openai
import json
openai.api_key = "OPENA_API_KEY"  # supply your API key however you choose



promptsArray = ["Hello world, from", "How are you B", "I am fine. W", "The  fifth planet from the Sun is "]

stringifiedPromptsArray = json.dumps(promptsArray)

print(promptsArray)

prompts = [
    {
    "role": "user",
    "content": stringifiedPromptsArray
}
]

batchInstruction = {
    "role":
    "system",
    "content":
    "Complete every element of the array. Reply with an array of all completions."
}

prompts.append(batchInstruction)
print("ChatGPT: ")
stringifiedBatchCompletion = openai.ChatCompletion.create(model="gpt-3.5-turbo",
                                         messages=prompts,
                                         max_tokens=1000)
batchCompletion = json.loads(stringifiedBatchCompletion.choices[0].message.content)
print(batchCompletion)

Continued in comments.

12 Likes

Explanation

  • The promptsArray contains all the prompts that will be processed in a batch, as individual elements of a string array.

  • The promptsArray is then converted to a string using JSON.stringify() and stored in stringifiedPromptsArray, which will be used as the content of the user’s message.

  • batchInstruction is a system message that directs the chat completion model to complete every prompt in stringifiedPromptsArray and return an array of completions.

  • The chat completion is obtained from the response and converted back into an array of strings using json.loads(). The individual completions can then be easily accessed from batchCompletion

Output:

['Hello world, from', 'How are you B', 'I am fine. W', 'The  fifth planet from the Sun is ']
ChatGPT: 
['Hello world, from Earth', 'How are you Bob', 'I am fine. What about you?', 'The fifth planet from the Sun is Jupiter']

Limitations

  • The max_tokens doesn’t control the max_tokens for individual prompts; instead max_tokens limits the total amount of tokens per request.
  • Length of one completion can influence other completions in the batch. If one completion is longer than expected, other completions may get truncated, even the array might not turn out to be valid.
6 Likes

could also check out reliableGPT for this - python package to handle batch calls to openai

1 Like

I’ve been testing with it and it’s fairly unreliable. The issue that the batch instructions go into the model and the model does not always answer with all completions in the array. Very often (so far about 50% of the time) it is only responding with one completion. This means you need to re-prompt which cost you another API call + tokens. While it’s a workaround, there’s no way this can be leveraged in production of anything.
I think either fix batching with ChatCompletion or make the increase of API rate limits easier.

1 Like

You’ve been testing with a model made unreliable by OpenAI. One now can’t be certain of even going back to -0301, because it was also hit with degradation, but you can try.

Complete every element of the array. Reply with an array of all completions."
}

Yeah, that’s going to be trouble now, anything with multiple outputs that aren’t a chat response to a user instruction, or requiring more than a trickle of tokens as a total response. The AI will find a way to crush the answer to 500 tokens.

Sad really.

This method doesn’t really save much, unless you have a massive instruction that can be saved by not reiterating it each time.

“batching” you can follow the parallel call cookbook example from OpenAI, and put another holdoff in there where you get a rate limit on an input.

Interesting. Can you share what are you using batching for?

Your example above. Same way, same way of promoting, just different input

Can you share the specifics of what went wrong? I just tested it on 10 separate strings and it worked.

Please note that this is just a hack/workaround to get batching on the API and not meant for production.

Hey this thread here is related to my issue, but doesnt solve my problem. Please take a look at my latest post, I would really really appreciate your opinion on it.

Hi @alessandroamenta1

This is meant to replicate completions batching on chat completion endpoint.

I have been dealing with this entire issue for the last week. The only success I had was requesting the system to return response as json. I am basically asking it to generate content in multiple languages using product data i share. Its ok now if if i do this for a single product at a time with occasional misses and json corruption but the system went haywire when i was trying to batch several products data points into a single prompt with the expectation that the system will return something like the output below
{“Sku77”:
{“en”:
{“title:”“,
description:”“,
keywords:”"
}
}
{“de”:
{“title:”“,
description:”“,
keywords:”"
}
}
}
{“Sku88”:
{“en”:
{“title:”“,
description:”“,
keywords:”"
}
}
{“de”:
{“title:”“,
description:”“,
keywords:”"
}
}
}

I tried other options as well like telling openai that i am sending u csv data with each row as a separate instruction. In such scenario it almost acted like the system was drunk. It would respond with incomplete responses, sometimes missing entire instructions. I realized i aint saving much by doing this bcoz the prompts cost almost half as much as the responses & the response were still unreliable.