Pricing by Context: Does Context (Short/Long context) includes complete tokens

There are two pricing tiers based on context size for the latest models.

Does the “context” here include:

  • prompt/input tokens only, or

  • prompt/input tokens + completion/output tokens?

A: the measured input you send triggers the increased price.

That’s a great question, and considering how a total context is a single memory space upon which an AI model continues to generate its output token-by-token, and where an output might be considered to cross a size threshold during generation, we need to take a closer look to determine the nature of billing and when the price has a significant jump. Let’s get into some background information to facilitate our detective work.

First, let’s consider context window itself. This was traditionally a shared space for both input and output, because there was no different treatment for input or output - the AI just continued generating another completion token after whatever is currently in the context window up to the most recent token. Then another, and another, considering the growing contents in informing the best token to continue with.

To illustrate, a completion language model such as GPT-3. text-davinci-003 had a context window length of 4k. If you prompted, “here’s a collection of 100 interesting poems”, merely 10 tokens of text preloaded by your API call, the entire remainder of a 4k model context window would be available space for generating a response. Likewise, if you used 3.99k merely loading context text, the size of the output would be limited and truncated if the AI needed to continue farther than the remaining 10 tokens for a response. if you sent over the model’s context length right away, or also used the max_tokens parameter to reserve some space only for output:

message: "This model's maximum context length is 4097 tokens, however you requested 5360 tokens (1360 in your prompt; 4000 for the completion). Please reduce your prompt; or completion length.",

That behavior changed with gpt-4-turbo, a 125k token context length. OpenAI created a maximum generation output limit of 4k. Yes, the models could only produce about 1000-3000 words, depending on the world language. They set aside a maximum you could get as output, however, you could still “infringe” on that final 4k space by sending input so that you’d only have 2k or 1k left for a response. Thus, apart from your output being cut off at a maximum you could receive, you still had a 125k combined input/output space.

That behavior changed again with gpt-5, a 400k context length. The maximum output at that point was 128k you could receive. Perhaps because of frequently-asked-questions and confusion about the behaviors just described, however, this possible 128k output space is always reserved exclusively for output. If you try to infringe on the output space needed by this specification, you’ll get an API error.

Now. How much is left of that context space for input, a length essentially for input, that can be treated separately and non-overlapping? 400k - 128k = 272k. Pay attention to why that number is important.

We look at the OpenAI price sheet, and hover over the indication that there is help about “long context”.

Hey, look, there’s 272k listed, the same input size remaining that I just described for the 400k context window model. Peculiar? It would seem on GPT-5 (original), there is no way to not have “short context”, because any sent input over 272k tokens would be denied.

What changed is GPT-5.4 and GPT-5.5 have a 1M context mode. You can send much more. Costs of running an AI model progressively increase with lengthier input in auto-regressive models (then somewhat capped by self-attention limits as a second layer). However, OpenAI decided to make a pricing “switch” for use of this advanced ability.

More than 272k tokens (of some sort that you are trying to understand in your question), a higher price.

The direct alignment with new models with new long input context being different than the previous cap you could send being 272k tokens is our immediate clue: Send more than previously possible as input, your price per token on GPT-5.4 and GPT-5.5 will be increased.

You asked about the cost of output tokens, which obviously switch. One thing to note is that this pricing “upgrade” is processed per AI model call. That can be not just a single call and you get a response, but also can be a single call by your agentic AI that continues building the length of a conversation session that continues to be re-sent. Later automated calls might experience the price increase just from how much internal text the AI has generated, now re-sent as input. That even can include internal hosted tools on the Responses API, which can re-call the model over and over with results of tool calling.

So: you can deterministically know the cost of your call by token measurement. 100% more cost for input and 50% more cost for output if your input exceeds the standard 272k.

For a final authority that is still not worded quite as clearly as might be desired, avoiding writing the answer where it belongs, let’s turn to “introducing gpt-5.4 on the API”…

GPT‑5.4 in Codex includes experimental support for the 1M context window. Developers can try this by configuring model_context_window and model_auto_compact_token_limit. Requests that exceed the standard 272K context window count against usage limits at 2x the normal rate.

In the API, GPT‑5.4 is priced higher per token than GPT‑5.2 to reflect its improved capabilities, while its greater token efficiency helps reduce the total number of tokens required for many tasks. Batch and Flex pricing are available at half the standard API rate, while Priority processing is available at twice the standard API rate.

At least the “switch” point is documented.

I hope that covered all your need to understand when you get the increased pricing.