n specifies how many chat completion choices to generate for each input message.
How does it work?
Does the prompt get sent
n times? For long prompts, it may hinder the token rate limit.
Or, does it do it in some optimized way (i.e. send the prompt once, and generate
You only “send” once, the only change to the API call is the value of the N number, and you are only charged for the extra tokens of the extra output cases.
While technical details aren’t made available, the most logical way is to load the input into the context of the model, and simply capture, wipe, and repeat the output generation that occurs at the generation start point in the context length area where the response is formed.
Do you know if the N replies are generated in parallel? from my experiments it seems the case since the response time for N = 10 is only a bit higher than N=1 (56 s vs 45 s in one experiment)
It seems that, as of now (Aug 16), when using GPT-4 model, n=N, the context tokens count towards the token limit per minute N times, despite the fact the context is not processed by the model N times on the backend.
Agreed, today for me, it seems that RateLimitError computes token count as prompt length * N (I get this error when setting large N, e.g. n=100). But if I submit a smaller value (say n=10) and look at the chat.completion object under “usage”, it computes prompt_tokens and total_tokens as expected (described in the accepted answer)