How do the models handle input sizes over the context window length?

How do the models handle going over the context window length for the response API? Will it return an API error, or just truncate the input?

Is there a place in the docs that specify this behavior that I missed?

Thanks in advance :slight_smile:

Concern: max generation length

It seems you are not worried about sending more than the maximum input, but rather, the effect of the AI model generating beyond the remaining available context length.

First, this is not an immediate concern with gpt-5 series. Of the 400k context window, you get a maximum 272k of that to use as input, thus there is always a remainder far beyond the length the AI is willing to write.

Then, that there is no “reservation” of output on other models by the limiter, max_output_tokens is just a budget, and you can send 1000000.

Effect of a cut-off output

I can report the effect of a truncated output, by max_output_tokens limiting it.

gpt-5x

Responses: receive the partial output if the AI had transitioned to producing output
Chat Completions: receive NOTHING, even what should have been seen, all billed as reasoning.

o4-mini, o3

Responses: receive the partial output if the AI had transitioned to producing output
Chat Completions: receive NOTHING, error without your usage: 'Could not finish the message because max_tokens or model output limit was reached

gpt-4o

Responses: receive the partial output, the exact max_output_tokens billed
Chat Completions: receive a bill for 100 output tokens, only get 20 tokens of the output.

Conclusion: OpenAI has degraded Chat Completions further, multiple issues robbing you of the partial content you paid for.


If you are using an internal tool iterator on Responses, and a server chat state, where you don’t have control how many times the AI will call itself over and over, then you can either get a 500 server error from too much input, or also an error from running the total remaining context up to maximum (while the AI is not responding with seen output).

You would have to use “truncation”:“auto”, which is an automatic “discard the start of conversations” (or on passed messages as input) to have the AI continue in this situation.


Evidence: the same exact maximum output run on non-reasoning gpt-4o, receive almost nothing on Chat Completions:

— Testing gpt-4o-2024-08-06 (Chat Completions)

In a quiet corner of a sunlit room,
A kitten stirred from a nap’s sweet b

input tokens: 24 output tokens: 100
uncached: 24 non-reasoning: 100

— Testing gpt-4o-2024-08-06 (Responses)

In the heart of a sunlit morn,
A kitten woke with a playful yawn.
Her eyes, like marbles, wide and bright,
Gazed at the world with pure delight.

She tiptoed past the garden gate,
Where flowers danced in colors great.
The breeze whispered secrets in her ear,
Of lands unknown, both far and near.

With a leap, she crossed the dewy grass,
Chasing shadows that seemed to pass.

input tokens: 24 output tokens: 100
uncached: 24 non-reasoning: 100

Crummy safety system holding back a long run of tokens, denying your Chat Completions output? Or just trying to foist Responses, when unwanted.

1 Like

What happens when this input of, in your first example, exceeds 272k tokens? Will it refuse to execute the model?

The edge limiter will give an API call model error:

Error: Input tokens exceed the configured limit of 272,000 tokens

If a 1M token input model like gpt-4.1, but you are of a lower payment tier such as tier-2 → 450000TPM for that model, you’d get a similar error from the rate limiter before anything could run.

1 Like