Max_tokens limits the total tokens used instead of the output tokens

I’m trying to use the api with model gpt-3.5-turbo, and I need to receive around 4000 tokens of output. I understand that the context window is around 16,000, so there shouldn’t be any problems with prompting with around 3,000 tokens and expecting 4,000 more tokens as output. However, I seem to be limited to around 4,000 TOTAL tokens.

I’ve tried setting the max_tokens parameters in two different ways.

  1. 4,000 - keep getting small response (usually whatever’s left from 4,000 minus the amount of prompt tokens)
  2. 16,000 minus the prompt tokens amount - the context window size, but get error [ Server!] Error: 400 max_tokens is too large: 14883. This model supports at most 4096 completion tokens, whereas you provided 14883.
  3. 16,000. Get error saying that the total amount of tokens is greater than 16,000: [ Server!] Error: 400 This model's maximum context length is 16385 tokens. However, you requested 19847 tokens (3398 in the messages, 449 in the functions, and 16000 in the completion). Please reduce the length of the messages, functions, or completion.

Thank you in advance for your help

Welcome to the Forum Nicolas!

A couple of points in response to your issue:

  1. By default, the latest models are limited to 4,096 output tokens independent of the context window size. So this is the absolute maximum you could yield. The amount of output tokens the model can theoretically return is furthermore influenced by the amount of your input tokens. Meaning, in the case of gpt-3.5 if you were to provide 14,000 input tokens, then there would only be 2,385 tokens available for output etc.

  2. In practice, the model rarely ever returns the full amount of 4,096 output tokens. Besides the amount of input tokens, the second factor that influences the length of output is your prompt. There are certain approaches and wording you can apply to get more detailed responses that can reach up to over 3,000 tokens. It typically requires a bit of trial and error.

  3. The max_token hyperparameter does not have a bearing on how many tokens a model produces in response to a specific prompt. It’s simply a means to limit the model’s response to a maximum amount of tokens. For example, if you set the value to 200, the model’s response will be cut exactly at 200 tokens - even if this is in the middle of the sentence.

Bearing these three points in mind, perhaps you can share details on what you are trying to achieve including an example prompt and we may be able to provide some additional ideas on how you can increase your output tokens.

1 Like

Thank you for your response, really appreciate it.
I’m using function calling and one of the main properties of the object is an array. I need the number of items to be specific. I was trying to use the minItems and maxItems properties as well as specifying in the prompt the number of items I wanted. Before reading this topic How to make function calling return array as long as I want? I thought that the reason for the API not returning more items was the number of tokens, but it turns out specifying minItems isn’t that effective…
After specifying in the items description the number of items I managed to get more responses with more tokens and the correct number of items.
Thank you for clarifying how these limits on the number of tokens work. Now I understand that I can indeed receive an output of 4000 while also providing a prompt of 3000 for example (so long as the total context is under 16,000).

1 Like