Batching prompts is not supported by the chat completions. There are a number of posts in the community forums complaining about this .
But these two pages, which have been edited since Developer Day, still recommend batching prompts.
You have a good discovery of documents that have limited applicability - and show use of a model that will be turned off in weeks. Solution answering no particular problem.
prompts = ["Once upon a time,"] * num_stories
# batched example, with 10 story completions per request
response = client.completions.create(
model="curie",
prompt=prompts,
max_tokens=20,
)
The information is useful if the endpoint and non-chat mode is made clear. It demonstrates completion - having the AI continue writing where text is left open.
The example above is still poor, because it relies only on token sampling variations to not make ten near-identical stories that unfold exactly the same due to their independence. A confidently-trained low perplexity AI will have even more uniformity. This is better demonstrated with a list of inputs.
bump. recommend request batching capability be added to chat completions api. it seems they want to steer the api towards chat completions vs completions, and batches to ease high request volume, but have not put the two together yet.
I implemented my own rate limiting into my code because I was occasionally hitting the RPM limit for GPT-3. 5
But my thought had been for the GPT-4 calls, if I could batch them, I could minimize latency of just waiting for data to move back and forth.
The app I’m working on has 50k-200k tokens of data per run split into 3000-5000 token chunks, and takes about 20 minutes just for the GPT-4 calls. If I could batch, I could probably shave off a minute or two. Not a huge amount, but still something.