Ultra Aggressive Truncation Behavior

I’m seeing some weird behavior with truncation:auto in the responses api.

It seems that anytime I use this param, the model automatically truncates anything over 75k tokens to around 72k tokens. The docs say it will only truncate when we go over the model context limit, but if you go over ~75k tokens it truncates.

`auto`: If the input to this Response exceeds the model's context window size,
*   the model will truncate the response to fit the context window by dropping
*   items from the beginning of the conversation.
https://developers.openai.com/api/reference/resources/responses

This is nowhere near the limit of these models (I’ve tried multiple models). I understand that it has to allow for space for other things but this seems like either the docs are wrong or I’m doing something wrong.

I’ve been searching for an answer for this for a while, but I haven’t seen anyone deal with this. And the truncation documentation is pretty bare, so any info anyone might have would be super helpful.

The example below should not truncate to 72k, but it does every time.

const getData = async (): Promise<void> => {
  const model = 'gpt-5.2';
  const client = getOpenAI();
  const input = "...imagine this is a 100k token string...";

  const response = await client.responses.create({
    model,
    input,
    truncation:        'auto',
    max_output_tokens: 10_000,
    reasoning:         {effort: 'none'},
  });

  console.log(`USAGE:`, response.usage);
};

/*
USAGE OUTPUT: {
  input_tokens: 72019,
  input_tokens_details: { cached_tokens: 0 },
  output_tokens: 212,
  output_tokens_details: { reasoning_tokens: 0 },
  total_tokens: 72231
}
*/

You seem to have a max_output_tokens defined at a low value (10000), have you tried increasing it? It defines the amount of output text + internal reasoning the model can generate.

1 Like

Yeah, tried everything from 5 to 50k, same behavior.

Having said that; increasing the max output tokens doesn’t seem like it would help my issue. If anything lowering seems like it would help more (since that would alleviate capacity).

That’s not how it works. Lowering max_output_tokens will not free context for the model. It determines a lower cap than the model’s output limit, and reasoning counts towards the output limit too. It is a parameter used to limit the budget.

For gpt-5.2, the model output limit is 128k, and if you consider that reasoning tokens also count towards it, the net output is usually lower. Therefore, 70k is not impossible depending on the complexity of the task.

By putting a limit of 10k, you’d get much less output. Looking at your usage stats, the total tokens you see are counting input tokens, which don’t account for output token limit, just for the total window of 400k (input+output). You are being stopped from generating the whole answer because you limited the output.

Again, you can still try increasing max_output_tokens to like 80k or remove the parameter.

Thanks, but I’m not talking about truncating the output. The output is fine. I’m talking about the truncating of the input by the model.

2 Likes

Sorry my bad, I misunderstood your question.

I was able to reproduce and confirm the issue here using the input_token.usage endpoint, and it really seems to be inducing an input truncation as you reported.

I'll leave it here for reference.
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def make_large_test_string(target_chars: int = 700_000) -> str:
    """
    Build a long synthetic string.
    Character count != token count, so we verify with input_tokens.count.
    """
    base = (
        "This is a synthetic test payload for OpenAI Responses API token counting. "
        "It is intentionally repetitive but includes varying section identifiers. "
        "The goal is to measure whether truncation='auto' changes the counted input size. "
    )

    parts = []
    total_chars = 0
    i = 0

    while total_chars < target_chars:
        chunk = f"[section={i}] {base}\n"
        parts.append(chunk)
        total_chars += len(chunk)
        i += 1

    return "".join(parts)


def count_tokens(model: str, text: str, truncation: str):
    result = client.responses.input_tokens.count(
        model=model,
        input=text,
        truncation=truncation,
    )
    return result

model = "gpt-5.2"
big_input = make_large_test_string(target_chars=700_000)

print(f"Character length: {len(big_input):,}")

disabled_result = count_tokens(model, big_input, "disabled")
auto_result = count_tokens(model, big_input, "auto")

print("\n=== input_tokens.count results ===")
print(f"disabled: {disabled_result.input_tokens:,}")
print(f"auto:     {auto_result.input_tokens:,}")

if auto_result.input_tokens < disabled_result.input_tokens:
    print("\nAUTO appears to be counting a truncated input.")
elif auto_result.input_tokens == disabled_result.input_tokens:
    print("\nAUTO and DISABLED counted the same number of input tokens.")
else:
    print("\nAUTO counted more than DISABLED, which would be unexpected.")

"""
OUTPUT:
Character length: 700,146

=== input_tokens.count results ===
disabled: 136,958
auto:     72,002

AUTO appears to be counting a truncated input.
"""
1 Like

GPT-5 models, while they advertise a 400k input context length, have a decided non-overlapping split, unlike other models:

input context: maximum 272000 tokens
output generation: maximum 128000 tokens (per generation/iteration)

That should mean that max_output_tokens is completely irrelevant. It should not consume or limit any amount of input based on a context window budget, and in fact, on Responses, you can set it to millions, because it is going to act as a multi-turn budget (when you use internal tools and run the context as input, over and over, for more reasoning and generation.)

(on Chat Completions, with no internal tools to budget for or against, max_tokens does act a bit like a “reservation”, deducting from the input you can send vs a full 128k model’s context window)

Thus, max_output_tokens is not a factor here to consider.


Shower thought: 272 characters is a lot like 73 tokens…if someone was a bad coder.

You must guarantee to yourself that you are actually sending the count of o200k_base tokens you think you are sending. Measure the text with Python + tiktoken. Use the /responses/input_tokens measurement endpoint - which has its own “truncation” you need not set.

The fault here if your token count is truly large but usage is smaller: is the API messing with your messages. You should be able to send a 1M character single role message before the API will refuse arbitrarily - which still can be under 250kToken. The “auto” truncation should only take action when the input exceeds a model’s maximum input, and that truncation should be turn-based dropping, not an inspection damaging your input messages themselves to destroy understanding.

Additionally, even if you were using a conversation ID or previous response ID, it should be those old messages that go away - 100k tokens being easy to place.

First, in this code, const input is a dangerous variable name to use. Call your assignment of a huge string something else, like megaPrompt.

The next approach I would take is to send a fully-qualified array of typed role messages instead of a string as input.

{
    "model": "gpt-5.2",
    "input": [
      {
        "type": "message",
        "role": "user",
        "content": [
          {"type": "input_text", "text": megaPrompt},
        ]
      }
    ]
  }

For further investigation, you can break up contents such as file references and user typed prompt as multiple user messages, placing them in the order they should expire when approaching the max. You can retrieve the input from a stored response ID and observe the nature of the input.

Then: what is your tier and TPM rate limit? gpt-5.2 should be 500,000TPM for even Tier 1. It would be improbable dumb intelligence if the placement and truncation was based on a remaining amount. Also dumb if based on a “we don’t have enough compute, so we’ll damage API calls” or a “we decided on a different unpublished max cost than the ridiculous one”.

Asking 272k questions of gpt-5.2 is a $0.48 proposition - so use gpt-5.nano (200k TPM limit on tier 1)

The final solution is to perform your own budgeting and turn-dropping; don’t use the truncation parameter.

(…there is the small chance that the input billing is messed up - and not the model placement).

1 Like

Thanks, I’ve tried a lot of these things (with the same results); messages vs input, different styles/sizes, and I use tiktoken for token counting…..nothing seems to work.

As far as the code, I was just trying to make the code above as easily to understand as possible.

I just can’t figure this out, it seems like I’ve gotta be doing something wrong, there’s no way this could be this broken.

1 Like

I think I got a clue on what is going on: there seems to be a limit on the max size for individual input elements on gpt-5.x family when using truncation:auto.

Since usually we don’t send a prompt that huge, but a collection of individual turn elements on a big conversation, this issue hasn’t been noticed very often.

That limit seems to be around 70k, but if you break down into elements it does go through as expected.

Here is a batch of tests breaking down a 700k payload into input items with different lengths:

Model: gpt-5.2
TOTAL_CHARS: 700,000
STEP_ITEM_CHARS: ['700,000', '400,000', '350,000', '300,000']

step | item_chars | items | payload_chars | single_item_tokens | disabled_tokens | auto_tokens | diff   | status         
-----+------------+-------+---------------+--------------------+-----------------+-------------+--------+----------------
1    | 700,000    | 1     | 700,000       | 136,406            | 136,406         | 72,002      | 64,404 | auto < disabled
2    | 400,000    | 2     | 700,000       | 77,951             | 136,414         | 130,465     | 5,949  | auto < disabled
3    | 350,000    | 2     | 700,000       | 68,208             | 136,414         | 136,414     | 0      | same           
4    | 300,000    | 3     | 700,000       | 58,465             | 136,420         | 136,420     | 0      | same           

Notice that under 350k chars or about 70k tokens it does not truncate and works as expected, and you can still send a 700k payload in 2 items.

So, a workaround solution is probably to break down the prompt into 2 or more elements like this when one exceed 350k chars:

{
  "model": "gpt-5.2",
  "input": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_text",
          "text": "This is the first user turn."
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "input_text",
          "text": "This is the second user turn."
        }
      ]
    }
  ],
  "truncation": "auto"
}

Or perhaps a file input if you are sending a book or similar, but I haven’t tested it.

Code
MODEL = "gpt-5.2"

# Fixed total payload size per step, in characters.
TOTAL_CHARS = 700_000

# Characters per individual input item for each step.
STEP_ITEM_CHARS = [700_000, 400_000, 350_000, 300_000]


def make_text(item_index: int, target_chars: int) -> str:
    prefix = f"[item={item_index}] "
    base_sentence = (
        "This is a synthetic test payload for the OpenAI Responses API. "
        "We are sending many separate input items instead of one large string. "
        "The purpose is to test whether truncation='auto' changes the counted "
        "input token total or appears to exclude earlier items from the input. "
    )

    text = prefix
    while len(text) < target_chars:
        text += base_sentence

    return text[:target_chars]


def make_single_input_item(item_chars: int) -> list[dict]:
    return [
        {
            "role": "user",
            "content": [
                {
                    "type": "input_text",
                    "text": make_text(0, item_chars),
                }
            ],
        }
    ]


def make_input_items(total_chars: int, item_chars: int) -> list[dict]:
    """
    Fill a total character budget using as many input items as fit.
    The last item may be smaller to exactly fill the budget.
    """
    items = []
    remaining = total_chars
    i = 0

    while remaining > 0:
        current_size = min(item_chars, remaining)
        text = make_text(i, current_size)

        items.append(
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_text",
                        "text": text,
                    }
                ],
            }
        )

        remaining -= len(text)
        i += 1

    return items


def approx_char_count(input_items: list[dict]) -> int:
    return sum(
        len(part["text"])
        for item in input_items
        for part in item["content"]
        if part["type"] == "input_text"
    )


def count_tokens(model: str, input_items: list[dict], truncation: str):
    return client.responses.input_tokens.count(
        model=model,
        input=input_items,
        truncation=truncation,
    )


def safe_count(model: str, input_items: list[dict], truncation: str):
    try:
        result = count_tokens(model, input_items, truncation)
        return result.input_tokens, None
    except Exception as e:
        return None, str(e)


def fmt_num(value):
    return "-" if value is None else f"{value:,}"


def fmt_text(value, max_len=38):
    if value is None:
        return "-"
    return value if len(value) <= max_len else value[: max_len - 3] + "..."


def print_table(rows):
    headers = [
        "step",
        "item_chars",
        "items",
        "payload_chars",
        "single_item_tokens",
        "disabled_tokens",
        "auto_tokens",
        "diff",
        "status",
    ]

    data = []
    for row in rows:
        data.append(
            [
                str(row["step"]),
                f'{row["item_chars"]:,}',
                f'{row["items"]:,}',
                f'{row["payload_chars"]:,}',
                fmt_num(row["single_item_tokens"]),
                fmt_num(row["disabled_tokens"]),
                fmt_num(row["auto_tokens"]),
                fmt_num(row["diff"]),
                row["status"],
            ]
        )

    widths = [len(h) for h in headers]
    for r in data:
        for i, cell in enumerate(r):
            widths[i] = max(widths[i], len(cell))

    def render_row(row):
        return " | ".join(cell.ljust(widths[i]) for i, cell in enumerate(row))

    separator = "-+-".join("-" * w for w in widths)

    print(render_row(headers))
    print(separator)
    for r in data:
        print(render_row(r))


def run_step(step_index: int, total_chars: int, item_chars: int):
    single_item = make_single_input_item(item_chars)
    payload_items = make_input_items(total_chars=total_chars, item_chars=item_chars)

    single_item_tokens, single_item_err = safe_count(MODEL, single_item, "disabled")
    disabled_tokens, disabled_err = safe_count(MODEL, payload_items, "disabled")
    auto_tokens, auto_err = safe_count(MODEL, payload_items, "auto")

    diff = None
    status = "ok"

    if disabled_tokens is not None and auto_tokens is not None:
        diff = disabled_tokens - auto_tokens
        if diff > 0:
            status = "auto < disabled"
        elif diff == 0:
            status = "same"
        else:
            status = "auto > disabled"
    elif disabled_err and auto_tokens is not None:
        status = "disabled failed"
    elif auto_err and disabled_tokens is not None:
        status = "auto failed"
    elif disabled_err and auto_err:
        status = "both failed"

    return {
        "step": step_index + 1,
        "item_chars": item_chars,
        "items": len(payload_items),
        "payload_chars": approx_char_count(payload_items),
        "single_item_tokens": single_item_tokens,
        "disabled_tokens": disabled_tokens,
        "auto_tokens": auto_tokens,
        "diff": diff,
        "status": status,
        "single_item_error": single_item_err,
        "disabled_error": disabled_err,
        "auto_error": auto_err,
    }


if __name__ == "__main__":
    print(f"Model: {MODEL}")
    print(f"TOTAL_CHARS: {TOTAL_CHARS:,}")
    print(f"STEP_ITEM_CHARS: {[f'{x:,}' for x in STEP_ITEM_CHARS]}")
    print()

    rows = []
    for idx, item_chars in enumerate(STEP_ITEM_CHARS):
        rows.append(run_step(idx, TOTAL_CHARS, item_chars))

    print_table(rows)

    print("\nErrors:")
    for row in rows:
        if row["single_item_error"]:
            print(f"Step {row['step']} single-item count failed: {row['single_item_error']}")
        if row["disabled_error"]:
            print(f"Step {row['step']} disabled failed: {row['disabled_error']}")
        if row["auto_error"]:
            print(f"Step {row['step']} auto failed: {row['auto_error']}")
1 Like