Do I need to increase
max_tokens when using
n=3 for generating multiple chat completions? For example, if I want to generate 3 possible completions each with a limit of 15 tokens, do I need to set
max_tokens=45 instead of
p.s. I’m using gpt-3.5-turbo chat completions.
What are you finding when you test those scenarios you mentioned?
I don’t know about for multiple… but when I’m playing around with it if I take the token count too high my bots get lost somewhere near their original token count and hallucinate a lot … best case scenario I’ve gotten was it kept repeating the last line it got stuck on
Hey, hey, be careful using “n=X”, it will make your cost X times more!
Welcome to the OpenAI community @ehutt
When you pass n>1 you’ll get
"n" number of completions, each following the specified
It doesn’t make your cost “X times more” to use N. It depends on if your cost comes mainly from input or output.
n parameters may also impact costs. Because these parameters generate multiple completions per prompt, they act as multipliers on the number of tokens returned.
Your request may use up to
num_tokens(prompt) + max_tokens * max(n, best_of) tokens, which will be billed at the per-engine rates outlined at the top of this page.
In the simplest case, if your prompt contains 10 tokens and you request a single 90 token completion from the davinci engine, your request will use 100 tokens and will cost $0.002.
So if you have a prompt “Scan this 15000-token book chapter text, count and return a response NumberOfSpellingErrors: X and NumberOfGrammarErrors: Y”, it would almost be silly not to request N=9 to ensure fluke-free reliable results - which would be far cheaper than asking again even once.
You can even do some trickery with baseline (non-chat) models, like do “best of 8” and “return N=3” where the best_of gives you the best logprobs for the whole answer and discards the five lowest.
Hi, thanks for the clarification, what do you think how they might implemented “best” detection? I mean how do they determine which is the best?
You give them all back to the AI and ask it.
How is “best of” determined?
When the AI is completing an answer, it does so a token at a time (a word fragment). This is determined by a weighting algorithm from the massive knowledge and the prompt input.
For example, “Roll a six-sided die and give me the result.”, the highest probability token is “4”. Or “One example of a yellow fruit is a:”, and you’re going to get banana.
Producing exactly the same deterministic output for the input every time is not very creative, so a bit of randomness is introduced that can choose lower-probability words, controlled by softmax temperature. The conversation completion can then go in a different direction based on how to finish the sentence or more with the alternate word choice.
A total score can be assigned to the entire output of tokens and their probabilities to see how likely it was. An output with several divergences and the AI needing to adapt in new ways then scores lower than one that follows the charted path and doesn’t have language awkwardness.
I was going to give you an example within the playground of strictly completion using the most likely tokens, versus a string made of the 5th-likely tokens, (where instead of the next word being “headache” we continue with “sense”) but I got an awkward “end of text” token as 5th likely very soon, so play by the rules, the output would then be done:
Too many questions gives me a:
headache 92.6% -- sense 0.37%
\n 65.57% -- <|endoftext|> 0.01%
\n 99.97% --
\n 79.59% --
From the percentages, you can still see the short second column would have a lower total score when the response is evaluated.