Not allowed to have all 8192 tokens

I’m getting this error in the gpt-4 api during a chat completion:

“This model’s maximum context length is 8192 tokens. However, you requested 8192 tokens (4505 in the messages, 3687 in the completion). Please reduce the length of the messages or completion.”

Of course I can just request fewer tokens, but what is the real maximum? 8191? 8000?

1 Like

I’m wondering if the API is really telling you that you completion was going to go over then token limit, and had to truncate the response to max_tokens instead.

Did you get a truncated response back?

This is a little fluke I’ve also seen. There is an extra token consumed or reserved, perhaps an unseen stop token.

You can also just not specify a max_tokens API value which will always let you use all remaining context length for response creation.

I know there is the theory of the elusive stop token out there. But I haven’t seen any proof of it. To prove that theory, send a 8191 token input and ask the model to respond with a 0 or 1 (1 token).

If that works, then the model here is 8192 max tokens, and the error message the OP sees is really a reflection of response truncation, since the model cannot “see” past 8192 tokens.

For example, it won’t say, “I was going to respond with 9012 tokens, but decided not too because of the token limit.”

So guessing the token limit of 8192 is hit, spits out the message since it didn’t see the stop flag, and reports the baffling 8192 max token message. And this is simply a misleading message.

Maybe someone can prove this.

This is what ChatGPT has produced before when it had a bug revealing its context length and reservation:

This model’s maximum context length is 8193 tokens, however you requested 8689 tokens (7153 in your prompt; 1536 for the completion). Please reduce your prompt; or completion length.)

So there is a disparity between the current API endpoint calculations and that of the model size. Perhaps an allowance was made so people aren’t charged for 2 tokens when the second is the recognized stop sequence that produces a finish but is not seen (and not documented for chat API).

  1. Get the AI to produce im_end tokens which ends the output, see if you are charged 0 by 0 tokens returned while obviously you told the AI what to produce;
  2. Don’t specify max_tokens, and simply get the maximum completion available all the time.

I think the disparity lies in the definitions, and hence the confusion …

For example, I am fully aware of im_end being a real token in the tokenizer. It’s usually the highest valued integer in the tokenizer output, and one “hack” to get max output response is the use logit_bias to suppress this token … so it’s a real thing!

But these special tokens are what I would refer to as “internal system tokens”, whereas in the API docs, I assume they are referring to usable “user facing tokens”.

So my assumption is that when they say in the docs that the model has a maximum of X tokens, I assume these are ALL user tokens, and do not contain internal tokens like im_end.

But sometimes in the docs, you see slips, where it is mentioned that a model has, for example 4097 tokens, but I assume they are literally meaning the full 4096 user + 1 internal token. So this means the user theoretically can input a 4095 input and get a 1 token output, and everything is OK.

Going back to the OP question. I’m guessing the 8k version of GPT-4 really has 8193 tokens, 8192 user + 1 internal, and the OP hit their 8192 user allocation, and got the message back.

BUT all of this can be validated experimentally. My advice, don’t worry about it, just cut back a few tokens and call it good!

2 Likes

To answer my own question, 8191 tokens seems to work fine as a max_tokens.

I’m having a really hard time trying to use the API to write something complex. Over and over again, I get messages like this one:

RuntimeError
[400] This model’s maximum context length is 16385 tokens. However, you requested 22114 tokens (7114 in the messages, 15000 in the completion). Please reduce the length of the messages or completion.

I was using 3.5-turbo-16k, set at 15k tokens max. I don’t see how this error response can be accurate. The entire input, including json headers and data, was under 2k English words, and if the .75/1 ration of words/tokens on average, it’s just not in the realm of possibilities. Also, as output, I requested 10 to 20 sentences. The average English sentence is 15 to 20 words, which would put 400 English words on the high side of the output.

2000 (input English words) + 400 (output English words) = 2,400, and even with the conversion to tokens, is many times smaller than what the error says.

I’ve gotten this error message many times, pretty much all day today.

What’s also odd is if I reset the context window, max_tokens, to a smaller number, sometimes it works, but the results aren’t very good. I don’t see how reducing max_tokens should eliminate the error, but sometimes it does; if anything, you’d think it would help.

I’m hopeful your response will help me and others facing this problem.

The max_tokens parameter is the space reserved from the context length exclusively for forming an answer.

Setting it to 15k means that only 1k is remaining to accept input.

You can omit the max_token value from the API call, and then all remaining space can be used for forming an answer (which can be dangerous if the AI gets stuck in a repeating loop)

1 Like

Sorry, I’m not sure I understand. I thought the 15k max_tokens (that I had set) is for input tokens and output tokens. The total input tokens here would be 3500, on the very high side, probably a lot less. The output, as requested, should be on the order of less than 500 words (10 to 20 sentences). I don’t understand how 4000 gets even close to what the error message says - namely, 22k tokens (7k input, 15k output)?

Yes, I can try not using max_tokens if you think it would help.

Thanks so much for your response!!

You understood wrong.

max_tokens is the limit of the response you will get back.

max_tokens also reserves space exclusively for this response formation.

The context length of a model is first loaded with the input, and then the tokens that the AI generates are added after that, in the remaining space.

Language is formed in a transformer language model by continuing the next token that should logically appear one at a time based on previous input and generated response so far.

(In an ideal world, there would be two parameters, a response_limit which would ensure that you don’t spend too much money, and a minimum_response_area_required to throw an error if you provided too much input to allow expected response formation. However millions of developers and lines of implemented code use the existing system.)

I don’t mean to be dense here, apologies.

Open AI’s documentation defines Token Limits as:
“Depending on the model used, requests can use up to 4097 tokens shared between prompt and completion. If your prompt is 4000 tokens, your completion can be 97 tokens at most.”

Here is how max_tokens is defined:
“The maximum number of tokens to generate in the chat completion. The total length of input tokens and generated tokens is limited by the model’s context length.”

Example Python code for counting tokens. The maximum number of tokens to generate in the chat completion.

“The total length of input tokens and generated tokens is limited by the model’s context length.” Example Python code for counting tokens.

-the maximum number of tokens required to complete the response -

In other words, to me that sounds like my prompt (input) plus the GPT’s response (output) totals is the token limit for the model. It’s also what Chat GPT is telling me in my conversations with it.

With that in mind, at least the literature seems to say that max tokens is what you put in with your prompt plus what the model puts out via its response.

I don’t doubt the procedure you’re mentioning is how it works. Sounds like you know a lot about it.

I’m just trying to overcome an error I keep getting over and over again, where my input is a fraction of the max_tokens, and my output added to it shouldn’t come anywhere near it.

There’s gotta be something I’m missing. I’ve been getting this error regardless of how I reset the context length, and whether I do an api call to gpt-4 or even gpt-3.5-16k. It just doesn’t add up.

Have you faced this in your api calls? Any advice about it?

Yes, confusingly, in some places where OpenAI lists the models, the models’ context length is referred to as “max_tokens”. The documentation should be fixed.

This then makes the parameter for the maximum response size, also called max_tokens, less than clear. The API parameter should have been named response_size_reservation or something similar to remove ambiguity about its purpose.

What you’re saying makes perfect sense. Are you aware of my problem being a bug, or others having it? Thanks again for your feedback here

You’ve only described reserving far too much area for the response, which blocks your input.

Set max_tokens to 1500, and you’ll be allowed ~2500 tokens of input on gpt-3.5-turbo (4k), or ~14500 tokens of input on gpt-3.5-turbo-16k.

you’re right, max_tokens must be the output, you’re a real life saver, thanks!!