Truncated gpt response when max_output_token is low

I notice that after I lowered the max_output_token from 300 to 100, the chances of GPT-4-turbo responding with cut off text is much higher.

A workaround I can think of is to detect the presence of ‘.’ , ‘!’, or ‘?’ in the response. If it doesn’t exist, discard and re-run with larger max_output_token. But this is a ugly workaround.

Is there a better solution?

My understanding is that somehow the ‘stop’ token (or whatever is equivalent in GPT) is never generated in this case. Is it a limitation of the transformer decoder architecture that this annoying issue is happening?

What are you trying to accomplish?

Maybe you’re slightly misinformed as to what max_tokens actually does

Max tokens just cuts your generation off after n tokens, unless the stop token’s been generated before that.

The model itself doesn’t actually see this parameter; it doesn’t inform the model how long you want your text to be. It’s just a dumb counter that shuts it down when it gets too long. You can think of it as a kill switch or failsafe.

If you’re gonna retry if it hasn’t finished, you might as well just leave max_tokens as undefined.

Does that help?

To control the generation of complete responses within the constraints of max_tokens directly with a Large Language Model (LLM) like GPT-4 without manually handling the issue of truncated responses, you can use

  1. Prompt Engineering: You can append instructions like “In two sentences, explain…” or “Summarise in 100 words…” to your prompt.

  2. Adjust Sampling Parameters:

  • Temperature: Lowering the temperature can make the model’s responses more deterministic and concise, potentially fitting more effectively within the max_tokens limit.
  • Top_p (Nucleus Sampling): Adjusting top_p can also influence the conciseness and relevance of responses, making easier to fit within a specified token count.
  1. Use the stop Parameter:
    The stop parameter allows you to define specific tokens (e.g., punctuation marks or phrases) where the model should stop generating further text. By setting this parameter to common sentence-ending punctuation (“.”, “?”, “!”), you can encourage the model to complete its thoughts before reaching the max_tokens limit.

Response without mentioned suggestion

"content": "Emotional bias refers to the tendency for individuals to make decisions or interpret information based on their emotions rather than objective reasoning. This bias can influence perceptions, judgements, and behaviors in various aspects of life, including personal relationships, work, and decision-making.\n\nEmotional bias can manifest in different ways, such as:\n\n1. Confirmation bias: This is the tendency to seek out information that confirms preexisting beliefs or emotions, while ignoring or discounting information that contradicts them.\n\n2. Loss aversion:"

As you can see, the response got truncated.

Suggestions Python Code:

import requests
import json

url = ''

headers = {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer YourKEY',

data = {
    "model": "gpt-3.5-turbo",
    "messages": [
            "role": "system",
            "content": "You are a helpful assistant."
            "role": "user",
            # One Way could be you mention the LLM to generate under the limit
            # "content": "Explain the complete concept of Emotional Biais under 370 characters."
            "content": "Explain the complete concept of Emotional Biais."

    "max_tokens": 100,
    "stop": [".", "?", "!"],  # using stop parameter
    "temperature": 0.4 # with low temperature

response =, headers=headers, json=data)

if response.status_code == 200:
    print(json.dumps(response.json(), indent=2))
    print("Error:", response.text)

Response with this code

"content": "Emotional bias refers to the tendency of individuals to make decisions based on their emotions rather than on objective evidence or rational thinking"
1 Like

Thank you for all your suggestions. I will try them!

The issue with I see with setting the stop parameter to (“.”, “?”, “!”) is that it will stop the generation when it encounters the first instance of sentence-ending punctuation (“.”, “?”, “!”). The model would use exactly one sentence for each response which is not what I want.

1 Like