The docs say the context window for GPT-5 is 400k tokens, but in production we’re hitting this error:
Input tokens exceed the configured limit of 272,000 tokens. Your messages resulted in 297,006 tokens.
This is a huge gap in the documentation. Every other model we’ve used has had its maximum input limit equal to its advertised context window. But with GPT-5, the actual max input is 272k tokens where the rest of the 400k is reserved for the maximum possible output (128k tokens). Nowhere in the official GPT-5 model card is this called out.
The only place this limit seems to be mentioned is in the GPT-5 blog post, and it’s not searchable from the API docs or API reference. If you only follow the model card, you’ll assume you can send the full 400k tokens as input and you’ll get hard errors in production.
Other models do not behave like this. For example, if I send the same request to o4-mini, the error matches its advertised 200k token limit:
This model’s maximum context length is 200,000 tokens. However, your messages resulted in 297,006 tokens.
The difference in error behavior and lack of clear documentation for GPT-5’s split input/output limits is a major pitfall for developers. This needs to be clearly stated in the API model cards so teams don’t waste time chasing what looks like a bug.
I think they’re saying 400k is total context, it’s outputting 128k, so guess what 400-128 is? Same number as mentioned in that blog post you linked.
Having said that, I also thought 400k was input prize, but I guess they wrote it as such because they used this prompt to come up with the number:
SA: Other LLMs from Google and such have 1M token window, so how can we make ours sound similar to those, even though they are not?
GPT5: It sounds like you want the model to seem closer to the competitors. Instead of labeling the input token and output token limits, you could label the total token limit along with the output token limit. Technically, you would not be lying. Let me know if you would like me to create a few comparison charts for you that you could use during the announcement presentation.
If you send “max_output_tokens” to a normal model, that is considered not just a cutoff for generation, but the rate limiter also acts like it like a reservation of output space, reducing the amount of input you can send, counting against your tier limits when deciding to block.
GPT-5 doesn’t have that parameter facing the developer in the same way, it seems. I haven’t tried probing to directly provide you answers any different (which is free if you start high and get errors), but the expectation would be if the model can generate a maximum of 4000 entailment into the context window by you using a max_out parameter, the remainder is for input.
(unless, that is, OpenAI is also having a hidden round of inspection, judgement, policy that is done by the same model consuming unreported reasoning, that shall not be limited or exposed)
Until today it is not clear what the actual maximum input tokens is because I was able to successfully process an input of 292k tokens with gpt-5.2 but can’t seem to go above that regardless of how I set max_output_tokens - it is frustrating to have this lack of clarity in the documentation.