How does `n` parameter work in chat completions?

satbek · July 6, 2023, 2:18am

The parameter n specifies how many chat completion choices to generate for each input message.

How does it work?

Does the prompt get sent n times? For long prompts, it may hinder the token rate limit.
Or, does it do it in some optimized way (i.e. send the prompt once, and generate n completions)?

_j · July 6, 2023, 2:59am

You only “send” once, the only change to the API call is the value of the N number, and you are only charged for the extra tokens of the extra output cases.

While technical details aren’t made available, the most logical way is to load the input into the context of the model, and simply capture, wipe, and repeat the output generation that occurs at the generation start point in the context length area where the response is formed.

bakaburg1 · August 8, 2023, 8:03am

Do you know if the N replies are generated in parallel? from my experiments it seems the case since the response time for N = 10 is only a bit higher than N=1 (56 s vs 45 s in one experiment)

leventov · August 15, 2023, 4:59pm

It seems that, as of now (Aug 16), when using GPT-4 model, n=N, the context tokens count towards the token limit per minute N times, despite the fact the context is not processed by the model N times on the backend.

geoff3 · August 16, 2023, 8:45pm

Agreed, today for me, it seems that RateLimitError computes token count as prompt length * N (I get this error when setting large N, e.g. n=100). But if I submit a smaller value (say n=10) and look at the chat.completion object under “usage”, it computes prompt_tokens and total_tokens as expected (described in the accepted answer)

theofficialjeffan · December 9, 2023, 7:44am

wait so just to confirm, with a low n, the input tokens are only counted once in terms of the input token limit and not n * tokenizer(input prompt string)?

_j · December 9, 2023, 8:52am

The rate limit system is a bit impenetrable but I just did a test: n=80, with 1 token in and 1 token out for my max_tokens setting (completions so there is no chat overhead).

It counted 80 tokens against my rate shown in the headers. Not 81, not 160.

So if you want to be the one to spend more on some larger requests, different input size vs output size so you can see which is being counted, you can see how the input further affects the rate limit count.

For billing of that n=80, the usage is input=1 output=80.

theofficialjeffan · December 10, 2023, 8:00am

sorry just to clarify im referring to the max token limit per completion (i.e. ~16k for gpt 3.5 turbo). im curious if n affects how a single request counts towards that limit, not some global tokens per minute limit.

_j · December 10, 2023, 8:33am

Each completion is its own run, getting its own independent remaining context length space (or what you set max_tokens to). So the same as if n=1.

However you may face timeouts or indeed rate limit blocking if going for (4096 tokens x 100) and its estimated usage based on your max_tokens.

The end impact will be the usage bill as described.

Some code making (10 max x10) completions

from openai import OpenAI
client = OpenAI()
params = {
  "model": "gpt-3.5-turbo-instruct",
  "prompt": "You rock my dude\n\n",
  "max_tokens": 10, "n": 10,
}
api = client.completions.with_raw_response.create(**params)
c = dict(api.parse())
usage = dict(c.get('usage'))
usage_prompt = usage.get('prompt_tokens')
usage_completion = usage.get('completion_tokens')
remaining = int(api.headers.get('x-ratelimit-remaining-tokens'))
for i in c['choices']:
  print(i.text + "\n ===")  # print each completion
print(f"in: {usage_prompt}\nout: {usage_completion}\nremain: {remaining}")

the output:

Thank you so much! That means a lot to
 ===
Thanks for being awesome! It's people like you
 ===
Thank you! I appreciate the compliment! :)
 ===
You’re absolutely crushing it!

I’m constantly impressed
 ===
I am a virtual AI, so I don't
 ===

Aw, thank you so much! I really
 ===
I really appreciate all the hard work and dedication you
 ===
Glad you think so! I'm just a
 ===
I'm glad you think so, thanks!
 ===
I'm glad that I can brighten up your
 ===
in: 5
out: 98
remain: 249900

TonyAIChamp · December 10, 2023, 9:02am

Interesting point. Any hypothesis why if these are separate runs, we are being charged for only 1*(prompt tokens number), not n*(prompt tokens number)?

_j · December 10, 2023, 9:11am

One precalculation of input sequence’s internal state across the input context length?

TonyAIChamp · December 10, 2023, 9:14am

Nice! Where’s this image from?

for bot

_j · December 10, 2023, 9:20am

    flowchart

Topic		Replies	Views
Questions on setting n and max_token API	4	815	March 20, 2024
Clarification on token pricing for multiple completions (n>1) in a single API call" API pricing	1	354	July 3, 2024
Do I need to increase `max_tokens` when using `n>1` e.g. `n=3` for generating multiple chat completions API	8	2034	July 2, 2023
Is the max_tokens parameter of the completions endpoint applicable for ALL or EACH response? API	7	2202	July 3, 2023
Multiple prompt responses everywhere API	6	3594	December 25, 2023

How does `n` parameter work in chat completions?

Related topics