Is the max_tokens parameter of the completions endpoint applicable for ALL or EACH response?

The max_tokens parameter of the completions endpoint determines the maximum number of tokens the model is supposed to generate, but it’s unclear to me whether this limit is for all or each response. Example: if max_tokens is 10 and the number of responses n is 2, then can the model generate 10 tokens for both the first and the second response or it can only generate 10 tokens across all responses? After some experimentation, it seems that max_tokens limits the number of tokens for each response.

For example, if you execute the following code, it will print 5 (the value of max_tokens) and we get two responses with that number of tokens

import openai
import tokenizers # hugging face package

completions = openai.Completion.create(model="text-ada-001",
                                       prompt="tell me about quantum physics",
                                       temperature=1,
                                       max_tokens=5,
                                       n=2)

tokenizer = tokenizers.Tokenizer.from_pretrained("gpt2")
response = "\n\nQuantum physics"
print("Length of each response: ", len(tokenizer.encode(response).tokens))

Please, I’d like an answer from official OpenAI employees or developer, no guesses. I am already doing more than guessing here based on my experiments.

I encourage you to update the docs: OpenAI API, which are ambiguous, hence my question.

1 Like

Each API call

HTH

:slight_smile:

I am not sure what you mean because, from my experiments, that’s not correct. max_tokens seems to limit the length of each response, not “each API call”, whatever that means.

Each API call means the response, actually (completion + prompt).

That is why I called it “each API call”

But the response contains a lot more than what the model generated. So, your answer doesn’t make any sense to me and doesn’t seem to be consistent with the OpenAI API documentation, which says that The maximum number of tokens to generate in the completion., and the completion is not the HTTP response! Even if by “response” you mean model’s completion, then that doesn’t answer my question. It seems that you didn’t even read my question.

:slight_smile:

How is that useful? I am not sure if you’re a bot or not, but you’re not being helpful honestly.

I had the same question, posed on this forum under the headline “Do I need to increase max_tokens when using n>1 e.g. n=3 for generating multiple chat completions” (I wasn’t allowed to post the link in this comment).

Based on my own testing and the comments on my thread, it seems that the max_tokens parameter applies to EACH of the n generations you request. It will not try to fit ALL completions into your max_tokens limit. I was using gpt-3.5-turbo but assume the same applies to other completion models.

Hope this helps

1 Like