The parameter n specifies how many chat completion choices to generate for each input message.
How does it work?
Does the prompt get sent n times? For long prompts, it may hinder the token rate limit.
Or, does it do it in some optimized way (i.e. send the prompt once, and generate n completions)?
You only “send” once, the only change to the API call is the value of the N number, and you are only charged for the extra tokens of the extra output cases.
While technical details aren’t made available, the most logical way is to load the input into the context of the model, and simply capture, wipe, and repeat the output generation that occurs at the generation start point in the context length area where the response is formed.
Do you know if the N replies are generated in parallel? from my experiments it seems the case since the response time for N = 10 is only a bit higher than N=1 (56 s vs 45 s in one experiment)
It seems that, as of now (Aug 16), when using GPT-4 model, n=N, the context tokens count towards the token limit per minute N times, despite the fact the context is not processed by the model N times on the backend.
Agreed, today for me, it seems that RateLimitError computes token count as prompt length * N (I get this error when setting large N, e.g. n=100). But if I submit a smaller value (say n=10) and look at the chat.completion object under “usage”, it computes prompt_tokens and total_tokens as expected (described in the accepted answer)
wait so just to confirm, with a low n, the input tokens are only counted once in terms of the input token limit and not n * tokenizer(input prompt string)?
The rate limit system is a bit impenetrable but I just did a test: n=80, with 1 token in and 1 token out for my max_tokens setting (completions so there is no chat overhead).
It counted 80 tokens against my rate shown in the headers. Not 81, not 160.
So if you want to be the one to spend more on some larger requests, different input size vs output size so you can see which is being counted, you can see how the input further affects the rate limit count.
For billing of that n=80, the usage is input=1 output=80.
sorry just to clarify im referring to the max token limit per completion (i.e. ~16k for gpt 3.5 turbo). im curious if n affects how a single request counts towards that limit, not some global tokens per minute limit.
Each completion is its own run, getting its own independent remaining context length space (or what you set max_tokens to). So the same as if n=1.
However you may face timeouts or indeed rate limit blocking if going for (4096 tokens x 100) and its estimated usage based on your max_tokens.
The end impact will be the usage bill as described.
Some code making (10 max x10) completions
from openai import OpenAI
client = OpenAI()
params = {
"model": "gpt-3.5-turbo-instruct",
"prompt": "You rock my dude\n\n",
"max_tokens": 10, "n": 10,
}
api = client.completions.with_raw_response.create(**params)
c = dict(api.parse())
usage = dict(c.get('usage'))
usage_prompt = usage.get('prompt_tokens')
usage_completion = usage.get('completion_tokens')
remaining = int(api.headers.get('x-ratelimit-remaining-tokens'))
for i in c['choices']:
print(i.text + "\n ===") # print each completion
print(f"in: {usage_prompt}\nout: {usage_completion}\nremain: {remaining}")
the output:
Thank you so much! That means a lot to
===
Thanks for being awesome! It's people like you
===
Thank you! I appreciate the compliment! :)
===
You’re absolutely crushing it!
I’m constantly impressed
===
I am a virtual AI, so I don't
===
Aw, thank you so much! I really
===
I really appreciate all the hard work and dedication you
===
Glad you think so! I'm just a
===
I'm glad you think so, thanks!
===
I'm glad that I can brighten up your
===
in: 5
out: 98
remain: 249900
Interesting point. Any hypothesis why if these are separate runs, we are being charged for only 1*(prompt tokens number), not n*(prompt tokens number)?