What is the maximum response length (output tokens) for each GPT model?

On https://platform.openai.com/docs/models the following maximum response lengths are provided:

  • gpt-4-1106-preview (GPT4-Turbo): 4096
  • gpt-4-vision-preview (GPT4-Turbo Vision): 4096
  • gpt-3.5-turbo-1106 (GPT3.5-Turbo): 4096

However I cannot find any limitations for the older models, in particular GPT3.5-16k and the GPT4 models. What are their maximum response lengths? Is there any official documentation of their limits?

4 Likes

Based on the available slider range in the playground, GPT5.3-16k allows for 16384 output tokens and GPT4 for 8192 tokens. But I would prefer an official statement …

What about GPT4_32k? It’s not in the playground for me.

Bump!
Just find out about this new restriction. I am curious what are the output token limits for other models

1 Like

The max output tokens would be great if there was some way to specify a min_tokens. Even giving guidance around minimum word count or characters is inconsistent at best and ignored at worst.

2 Likes

Some new findings on this:

The available sources of documentation are inconsistent

  • The models documentation mentions 4096 output tokens for many models.

  • The playground provides a maximum of 4095 output tokens.

  • When passing a too large max_tokens value to the API, the error message mentions 4097 output tokens:

    $ curl https://api.openai.com/v1/chat/completions   -H "Content-Type: application/json"   -H "Authorization: Bearer $OPENAI_API_KEY"   -d '{
        "model": "gpt-3.5-turbo-0613",
        "messages": [
          {
            "role": "system",
            "content": "hi"
          }
        ],
        "max_tokens": 4097
      }' 
    {
      "error": {
        "message": "This model's maximum context length is 4097 tokens. However, you requested 4105 tokens (8 in the messages, 4097 in the completion). Please reduce the length of the messages or completion.",
        "type": "invalid_request_error",
        "param": "messages",
        "code": "context_length_exceeded"
      }
    }
    

The maximum response length depends on the size of the input

Other than I assumed, both input and output share the same max_tokens limit. As the playground puts it:

The maximum number of tokens to generate shared between the prompt and completion. The exact limit varies by model.

Trying to identify the maximum response length empirically

So far, I have attempted two approaches for this without success:

  1. Use logit_bias to enforce a single token everywhere (e.g., {'70540': 100}): Unfortunately, the output was always truncated after 100 tokens for me when using the gpt-3.5-turbo models (even when setting max_tokens to a higher value).
  2. Provide a random input and set the temperature to 2: While this often leads to “infinite” outputs as the model generates implausible text without a good stopping point, I was unable to hit the full response length of 4096 tokens with this.
1 Like

What did you mean by “random” input?

I had similar idea and experiments, but very recently, different model (GPT 4o) and different memory names and different limites. But same question. There is also the new file upload tool own size mysteries with respect to the model instance processing abilities from the environment back to the conversation memory management (obscure or difficult to find clear statements, specially in the web UI).

So empirical “truth” it came to. For the bots would send me on goose chases, with the upmost confidence I would succeed. But then it would base a bunch of its stuff on vague and wrong assumptions never shared. and then started the long quest to figure out how to stop thinking it understood just because the words, as human words would make it look like it. The reality of its environment, and my independent ability to share some common reality, the text files themselves.

what do chat bot words really mean…

now some context of my understanding:

  1. I am talking about ChatGPT web Ui, assuming same memory mechanical model (which is hard to find a clear diagram of anywhere, lots of words in vain, and not sure all makes use of the same terminology, the dates on help pages are not meaning that all the page has been updated, just that something changed).

  2. the date now is in October 2024 and the specifications that can be found would be those of GPT4o for my curiosity.
    Models - OpenAI API

The language there is “context window” = 128K and “max output tokens” = 16386

I mention that because it seems to have changed, and particularly from the outdated corpus times. Which makes a good discrepancy check experiment about the bot lack of system prompt updating their own self-awareness of their model class and abilities.

They have to be artfully babysat. And the dumber they start the longer the conversation to weed out all the wrong lurking assumptions. And now with huge capacities. So, I did resort to figure out the same way. But I specified to use correct syntax but no semantics. All the words would be tokenizable. Maybe that is obvious.

My version of your idea:

  1. I got to generate 3000 tokens outputs, but it might have been around the change of GPT 4o snapshot change. before it was 40xx tokens. I was wondering if one could not do a set of graduated extra chunks to find the upper bound and lower bounds.

  2. but the idea to improve on your idea was that one could concatenate the file files upload them, and then ask to stitch them back in context memory, into various combination to “binary” search knowing what to expect. and a single line prompt asking to output various combinations of such set of input sizes (from the context memory then) to the response text chunk.

Parent context of mine (why did I go reality check rabbit hole):
This was proposed to me in one conversation as a mean to deal with long text document substrate for my text transformation projects (translation).

plea:
please fix my words to make more sense. to this community, if my words are not the most resounding shop talk. I am only a month-old at this intensive struggling with finding out what is that semantics they keep talking about. And this very simple reality check, is nagging me. So much obscurity. And not just the bots.

Recap of extended empirical method idea:
So the solution is to work from under and keep going up. but you can accelerate with using the context memory and asking to transfer from the upload file mechanism previous under output limit response back into the conversation. (does not even have to be the same, if you want to not run into aging and declining conversation lengths).

off-shoot question
(given new interface features that did not exist at op time).
Also I wonder about the same question for file uploads, and then transfer into the context memory (context window 128K tokens), a.k.a. conversation history memory or context, a.Not.k.a single turn memory which, I agree with your assessment seems to be sharing the “max output tokens” between user prompt and bot response.

I’m not sure whether you meant to post this to a human or to ChatGPT, but for a human, your post is hard to read :slight_smile:

I meant random as in random characters, like from /dev/urandom. It was really just a naive way to minimize the likelihood of the model generating an end token. Also, I was only referring to the API. ChatGPT is a separate product that uses different versions of the models and how they generate the context for the model or process files can only be reverse-engineered with limited insights. Maybe someone else can better help you with that.