What is the maximum response length (output tokens) for each GPT model?

On https://platform.openai.com/docs/models the following maximum response lengths are provided:

  • gpt-4-1106-preview (GPT4-Turbo): 4096
  • gpt-4-vision-preview (GPT4-Turbo Vision): 4096
  • gpt-3.5-turbo-1106 (GPT3.5-Turbo): 4096

However I cannot find any limitations for the older models, in particular GPT3.5-16k and the GPT4 models. What are their maximum response lengths? Is there any official documentation of their limits?

2 Likes

Based on the available slider range in the playground, GPT5.3-16k allows for 16384 output tokens and GPT4 for 8192 tokens. But I would prefer an official statement …

What about GPT4_32k? It’s not in the playground for me.

Bump!
Just find out about this new restriction. I am curious what are the output token limits for other models

1 Like

The max output tokens would be great if there was some way to specify a min_tokens. Even giving guidance around minimum word count or characters is inconsistent at best and ignored at worst.

1 Like

Some new findings on this:

The available sources of documentation are inconsistent

  • The models documentation mentions 4096 output tokens for many models.

  • The playground provides a maximum of 4095 output tokens.

  • When passing a too large max_tokens value to the API, the error message mentions 4097 output tokens:

    $ curl https://api.openai.com/v1/chat/completions   -H "Content-Type: application/json"   -H "Authorization: Bearer $OPENAI_API_KEY"   -d '{
        "model": "gpt-3.5-turbo-0613",
        "messages": [
          {
            "role": "system",
            "content": "hi"
          }
        ],
        "max_tokens": 4097
      }' 
    {
      "error": {
        "message": "This model's maximum context length is 4097 tokens. However, you requested 4105 tokens (8 in the messages, 4097 in the completion). Please reduce the length of the messages or completion.",
        "type": "invalid_request_error",
        "param": "messages",
        "code": "context_length_exceeded"
      }
    }
    

The maximum response length depends on the size of the input

Other than I assumed, both input and output share the same max_tokens limit. As the playground puts it:

The maximum number of tokens to generate shared between the prompt and completion. The exact limit varies by model.

Trying to identify the maximum response length empirically

So far, I have attempted two approaches for this without success:

  1. Use logit_bias to enforce a single token everywhere (e.g., {'70540': 100}): Unfortunately, the output was always truncated after 100 tokens for me when using the gpt-3.5-turbo models (even when setting max_tokens to a higher value).
  2. Provide a random input and set the temperature to 2: While this often leads to “infinite” outputs as the model generates implausible text without a good stopping point, I was unable to hit the full response length of 4096 tokens with this.