The announcement covers the headlines well. This is the companion post covering the specifics you’ll want to know before your first API call, so you’re not figuring them out mid-build.
The 1M context window is opt-in (and worth understanding first)
The 1M token context is real, but it’s an experimental feature you enable explicitly by configuring model_context_window and model_auto_compact_token_limit.
Without those params, you’re on the standard 272K window. Requests that go beyond 272K count against usage limits at 2x the normal rate. Hence it’s worth sizing your workloads intentionally before enabling it.
For most tasks, 272K is plenty. The 1M path is there when you genuinely need it.
Pricing — and why token efficiency changes the math
| Model | Input | Cached Input | Output |
|---|---|---|---|
gpt-5.2 |
$1.75 / M | $0.175 / M | $14 / M |
gpt-5.4 |
$2.50 / M | $0.25 / M | $15 / M |
gpt-5.2-pro |
$21 / M | — | $168 / M |
gpt-5.4-pro |
$30 / M | — | $180 / M |
The per-token price is higher, but GPT-5.4 is meaningfully more token-efficient — in OpenAI’s MCP Atlas tests, tool search alone reduced total token usage by 47%.
Before assuming it’s a cost increase, it’s worth running a comparison on your actual workload. For tool-heavy agents especially, the net cost may surprise you.
Batch and Flex processing remain available at half the standard rate. Priority processing (the API equivalent of /fast mode in Codex) is available at 2x.
Tool Search: how to set it up?
Tool search is one of the most impactful new capabilities for agent builders, and it requires explicit setup rather than being on by default.
Instead of loading all tool definitions into the prompt upfront, the model receives a lightweight tool list and fetches definitions on demand. The result: smaller prompts, preserved cache, and the ability to work across much larger tool ecosystems.
The 47% token reduction number comes from running all 36 MCP servers in Scale’s MCP Atlas benchmark. For MCP-heavy setups, the efficiency gains are substantial. See the tool search guide for setup details.
Computer use: get your image detail level right
First of all:
GPT‑5.4 is our first general-purpose model with native computer-use capabilities
OpenAI introduced a new original image input detail level alongside changes to high:
original: up to 10.24M pixels, 6000px max dimension — newhigh(updated): up to 2.56M pixels, 2048px max dimension
Early testing showed strong gains in localization and click accuracy when using original or high. If you’re building computer use agents, setting the right detail level upfront will meaningfully affect your results — it’s one of the first things to dial in.
Reasoning effort levels and what the benchmarks actually reflect
Most benchmark numbers in the announcement were measured at reasoning_effort=xhigh. Performance at none looks different — though GPT-5.4 at none still outperforms GPT-5.2 on latency-sensitive tasks like τ²-bench Telecom (64.3% vs 57.2%).
For production workloads, it’s worth benchmarking at the reasoning effort you’ll actually use rather than defaulting to xhigh everywhere. The model is efficient enough at lower effort levels that you may not need to reach for xhigh as often as with previous models.
Fast mode
When toggled on, /fast mode in Codex delivers up to 1.5x faster token velocity with GPT‑5.4. It’s the same model and the same intelligence, just faster. That means users can move through coding tasks, iteration, and debugging while staying in flow. Developers can access GPT‑5.4 at the same fast speeds via the API by using priority processing.
ChatGPT plan rollout details
- Plus, Team, Pro: GPT-5.4 Thinking available today, replaces GPT-5.2 Thinking
- Enterprise and Edu: early access via admin settings — needs to be manually enabled
- GPT-5.2 Thinking: moves to Legacy Models, retires June 5, 2026
- GPT-5.4 Pro: available on Pro and Enterprise plans
Context windows in ChatGPT for GPT-5.4 Thinking are unchanged from GPT-5.2 Thinking.
The benchmark worth bookmarking
OSWorld-Verified tests desktop navigation via screenshots + mouse/keyboard actions:
GPT-5.4: 75.0%
Human: 72.4%
GPT-5.2: 47.3%
That jump from 47% to 75% - past human-level, is a meaningful signal for anyone building computer use agents. It’s the number that most changes what’s now reasonable to attempt.
Curious what workloads folks here are planning to test first. To me tool search and computer use feel like the areas with the most unexplored surface right now.


