Gpt-5.4 ignores reasoning_effort="none" when max_completion_tokens is used

Found a bug with the gpt-5.4 endpoint.

If you pass reasoning_effort: “none”, it works as expected (0 reasoning tokens). But if you also pass max_completion_tokens in the same payload, the API completely ignores the reasoning flag, defaults back to reasoning, burns through your entire max_completion_tokens budget with invisible tokens, and spits out an empty string with finish_reason: “length”.

Repro:

const OpenAI = require("openai");
const openai = new OpenAI();

async function testBug() {
  console.log("Testing with max_completion_tokens...");
  
  // FAILS: Ignores 'none', generates 100 reasoning tokens, and returns ""
  const buggyResponse = await openai.chat.completions.create({
    model: "gpt-5.4-2026-03-05",
    messages: [{ role: "user", content: "Explain the theory of relativity." }],
    reasoning_effort: "none",
    max_completion_tokens: 100 // <-- This breaks it
  });
  console.log("Buggy Reasoning Tokens:", buggyResponse.usage.completion_tokens_details.reasoning_tokens); 
  // Outputs: 100

  console.log("\nTesting WITHOUT max_completion_tokens...");

  // WORKS: 0 reasoning tokens, generates normal text
  const workingResponse = await openai.chat.completions.create({
    model: "gpt-5.4-2026-03-05",
    messages: [{ role: "user", content: "Explain the theory of relativity." }],
    reasoning_effort: "none" 
    // max_completion_tokens omitted entirely
  });
  console.log("Working Reasoning Tokens:", workingResponse.usage.completion_tokens_details.reasoning_tokens); 
  // Outputs: 0
}

testBug();

The only workaround right now is dropping max_completion_tokens entirely, which obviously isn’t ideal since we lose a hard cap on token costs.

What about max_output_tokens rather than completion, does that work?

Hi, no, it doesn’t work. max_output_tokens returns “BadRequestError 400 Unknown parameter: ‘max_output_tokens’.”

If you meant the old max_tokens parameter, it seems to be no longer supported with this model, as it returns:

BadRequestError 400 Unsupported parameter: ‘max_tokens’ is not supported with this model. Use ‘max_completion_tokens’ instead.

Hi!

Thanks for the clear repro and raising this!

I can confirm:
chat.completions behaves correctly with reasoning_effort: "none" on its own, but when adding max_completion_tokens, it then ignores none, uses the whole budget on reasoning tokens, and returns an empty string with finish_reason: "length".

One possible solution is to switch this call to the Responses API and use:

response = client.responses.create(
    model="gpt-5.4",
    input="Explain cats to dogs.",
    reasoning={"effort": "none"},
    max_output_tokens=100,
)

That returns normal text with reasoning_tokens: 0.

Thanks for confirming. I looked into the Responses API, but unfortunately it doesn’t support n > 1, which my app needs for generating multiple options at once.

For now, I’m just dropping max_completion_tokens entirely from my chat completions call so it doesn’t break. It’s a bummer losing the hard cap on costs though, so hopefully the team patches the main endpoint soon.

The fault seen here is that Chat Completions does not deliver the output if it is not complete. It is all classified as “reasoning” in usage, even if it is clear the output would have transitioned to the final seen output.

Then, that there actually is reasoning at “none”, just hidden behind a threshold of 128 tokens, before which where it is not billed.

I’ve made posts about this before. Let’s say the AI will write 500 tokens quite predictably. With max_completion_tokens at 300, 400, 500, you get all “reasoning” and never a non-stream output. Increase that more, eventually you get the switch to the full output instead of no output, getting what you paid for only when the AI has reached the stop sequence and the output is done.

This symptom on Chat Completions has continued on reasoning models, with no sign that my reporting of this issue has had any impact. You pay, to then not get the partial output.

Hey everyone, Thank you for flagging this behaviour. While the workaround would be to use response api, we have also flagged this to our engineering team and they are looking into this for you. Thank you!

Hey everyone, We have deployed fix and this issue should now be resolved. Thanks!

It is close to fixed, with Chat Completions giving output. But now another symptom.

Overbilling/Under-delivery on truncated outputs

First round of testing on gpt-5.4-mini-2026-03-17

  • reasoning_effort: none
  • non-streaming

Chat Completions @ max_completion_tokens: 282

  • billed 282 non-reasoning tokens
  • 255 tokens of o200k_base text actually received

Responses API @ max_output_tokens: 282

  • billed 282 tokens
  • also 255 tokens of o200k_base text actually received

This shows getting billed 27 tokens that were never delivered


Confirmation of overbilling across models: gpt-5.4-2026-03-05: 256 tokens delivered of 282 billed, with no reasoning billed. Same with gpt-5.2-2025-12-11.


Then go to a minimum amount: max_output_tokens: 20

  • Billed 20 tokens.
  • output of 16 tokens of text.

Or to max_output_tokens: 1000

  • Billed 1000 tokens.
  • output of 915 tokens of text

This shows getting billed 85 tokens that were never delivered

It seems it is not a reasoning bill, as the same input should precipitate the same amount of reasoning. Or if “still reasoning a hidden amount at none”, threshold is wonky.

Conclusion

  • Overbilling or under-delivery on truncated outputs;
  • Reasoning that still happens at “none” not being billed as reasoning, but as output?
  • Premature truncation before all generated output delivered?
  • Content inspection holding back a large run of undelivered tokens?

The more you receive, the more the overbilling or the more that is missing.


An extra-special note: you cannot replicate this in the chat playground on the platform site. Neither Responses nor Chat Completion there offers you a max_tokens on a reasoning model, meaning the model going nuts with loops or repeating assistant messages there will cost you several dollars.

Thanks for retesting. This looks like a separate follow-up from the original reasoning_effort: "none" + max_completion_tokens issue.

If reasoning_tokens is now 0 and visible text is returned, the original reasoning-token bug looks fixed. The remaining gap you’re seeing is different: usage reports the full token cap, but the delivered text appears to tokenize to fewer visible tokens.

Since the docs say max_completion_tokens / max_output_tokens include generated output tokens, including reasoning tokens, this is worth checking with the server-side trace rather than guessing from local tokenization.

Could you share just one affected request ID and its usage block? That should be enough to verify what was counted versus returned.