Clarification for max_tokens

My interpretation for max_tokens is it specifies the upper-bound on the length of the generated code.

However, the documentation is confusing. I am referring to the official API documentation OpenAI API

The maximum number of [tokens]( to generate in the completion.

The token count of your prompt plus `max_tokens` cannot exceed the model's context length. Most models have a context length of 2048 tokens (except for the newest models, which support 4096).

So at first documentation mention the maximum number of tokens to generate in the completion. But then it states it is token counts in the prompt + completion < 4000. I mentioned 4000 as it is the maximum token limit for davinci model.

So what is it?

  1. is it the maximum token that would be generated during completion?
  2. token counts in the prompt + ``completion` < 4000
  1. token counts in the prompt + ``completion` < 4000

I’m going with @overbeck.christopher here and staying on the conservative side the wording in context as original poster @nashid.noor pointed out remains confusing and leaves me questioning:

Should I add my set max_tokens to the token count of my prompt to arrive at a number no larger than the limit of the model I’m using?

These would be different numbers, but again I’ll work with the conservative approach for now.

Hi @nashid.noor, @overbeck.christopher and @kathyh

Every model has a context length. It cannot be exceeded.

As I shared above max_tokens only specifies the max number of tokens to generate in the completion, it is not necessarily the amount that will get generated.

However, if the sum of tokens in prompt + max_tokens exceeds the context length of the model, the request will be considered invalid and you’ll get a 400.


This model's maximum context length is 4096 tokens. However, you requested 4157 tokens (62 in the messages, 4095 in the completion). Please reduce the length of the messages or completion.


Thank you for answering my question in relation to how it plays out with each model’s context length. This is super helpful to understand the order of operations happening and when I could actually hit an error from mishandling these (my brain works backward seeing these terms in how they behave I guess).

1 Like

Another point of confusion is max_tokens defaults to 16 – has anyone confirmed this? I haven’t used the API, but the ChatGPT website completions can be longer than 16 tokens.

That is only for completions endpoints, which makes setting the max_tokens value essentially required.

For chat completion endpoint, you can simply not specify a max_token value, and then all the remaining completion space not used by input can be used for forming a response, without needing careful tedious token-counting calculation to try to get close.

Reminder, max_tokens is a reservation of the model’s context length that is exclusively for forming your answer, as well as setting a limit to how much comes back.

max_tokens only specifies the max number of tokens to generate in the completion , can you explain what is max number of tokens to generate in completion? here what does completion mean? does it mean the response generated by the llm?

Yes, a completion is the response from the LLM. The word “completion” comes from the original models that would return the most probable completion text for your input text. So basically, the autocompletion.

Here is max tokens of 6, with a completion model. It does a very advanced version of writing what comes next:


The colored text is AI’s six tokens of completion output after my un-highlighted writing prompt.

Because this “completion” is so talented and versatile, we can give other writing formats for it to complete:


With only six tokens, we didn’t get much text, however, I can have it “complete” what it was writing again for another six tokens:

I like the name “Ava.” It is simple,

So the max_tokens value will cut off the AI’s output if you don’t set it large enough. The AI doesn’t know what this setting is.

The chat endpoint puts all the “human” and “AI” banter into containers, and the model has been trained to perform in a conversation setting.