If you pass reasoning_effort: “none”, it works as expected (0 reasoning tokens). But if you also pass max_completion_tokens in the same payload, the API completely ignores the reasoning flag, defaults back to reasoning, burns through your entire max_completion_tokens budget with invisible tokens, and spits out an empty string with finish_reason: “length”.
Repro:
const OpenAI = require("openai");
const openai = new OpenAI();
async function testBug() {
console.log("Testing with max_completion_tokens...");
// FAILS: Ignores 'none', generates 100 reasoning tokens, and returns ""
const buggyResponse = await openai.chat.completions.create({
model: "gpt-5.4-2026-03-05",
messages: [{ role: "user", content: "Explain the theory of relativity." }],
reasoning_effort: "none",
max_completion_tokens: 100 // <-- This breaks it
});
console.log("Buggy Reasoning Tokens:", buggyResponse.usage.completion_tokens_details.reasoning_tokens);
// Outputs: 100
console.log("\nTesting WITHOUT max_completion_tokens...");
// WORKS: 0 reasoning tokens, generates normal text
const workingResponse = await openai.chat.completions.create({
model: "gpt-5.4-2026-03-05",
messages: [{ role: "user", content: "Explain the theory of relativity." }],
reasoning_effort: "none"
// max_completion_tokens omitted entirely
});
console.log("Working Reasoning Tokens:", workingResponse.usage.completion_tokens_details.reasoning_tokens);
// Outputs: 0
}
testBug();
The only workaround right now is dropping max_completion_tokens entirely, which obviously isn’t ideal since we lose a hard cap on token costs.
I can confirm: chat.completions behaves correctly with reasoning_effort: "none" on its own, but when adding max_completion_tokens, it then ignores none, uses the whole budget on reasoning tokens, and returns an empty string with finish_reason: "length".
One possible solution is to switch this call to the Responses API and use:
Thanks for confirming. I looked into the Responses API, but unfortunately it doesn’t support n > 1, which my app needs for generating multiple options at once.
For now, I’m just dropping max_completion_tokens entirely from my chat completions call so it doesn’t break. It’s a bummer losing the hard cap on costs though, so hopefully the team patches the main endpoint soon.
The fault seen here is that Chat Completions does not deliver the output if it is not complete. It is all classified as “reasoning” in usage, even if it is clear the output would have transitioned to the final seen output.
Then, that there actually is reasoning at “none”, just hidden behind a threshold of 128 tokens, before which where it is not billed.
I’ve made posts about this before. Let’s say the AI will write 500 tokens quite predictably. With max_completion_tokens at 300, 400, 500, you get all “reasoning” and never a non-stream output. Increase that more, eventually you get the switch to the full output instead of no output, getting what you paid for only when the AI has reached the stop sequence and the output is done.
This symptom on Chat Completions has continued on reasoning models, with no sign that my reporting of this issue has had any impact. You pay, to then not get the partial output.
Hey everyone, Thank you for flagging this behaviour. While the workaround would be to use response api, we have also flagged this to our engineering team and they are looking into this for you. Thank you!
It is close to fixed, with Chat Completions giving output. But now another symptom.
Overbilling/Under-delivery on truncated outputs
First round of testing on gpt-5.4-mini-2026-03-17
reasoning_effort: none
non-streaming
Chat Completions @ max_completion_tokens: 282
billed 282 non-reasoning tokens
255 tokens of o200k_base text actually received
Responses API @ max_output_tokens: 282
billed 282 tokens
also 255 tokens of o200k_base text actually received
This shows getting billed 27 tokens that were never delivered
Confirmation of overbilling across models: gpt-5.4-2026-03-05: 256 tokens delivered of 282 billed, with no reasoning billed. Same with gpt-5.2-2025-12-11.
Then go to a minimum amount: max_output_tokens: 20
Billed 20 tokens.
output of 16 tokens of text.
Or to max_output_tokens: 1000
Billed 1000 tokens.
output of 915 tokens of text
This shows getting billed 85 tokens that were never delivered
It seems it is not a reasoning bill, as the same input should precipitate the same amount of reasoning. Or if “still reasoning a hidden amount at none”, threshold is wonky.
Conclusion
Overbilling or under-delivery on truncated outputs;
Reasoning that still happens at “none” not being billed as reasoning, but as output?
Premature truncation before all generated output delivered?
Content inspection holding back a large run of undelivered tokens?
The more you receive, the more the overbilling or the more that is missing.
An extra-special note: you cannot replicate this in the chat playground on the platform site. Neither Responses nor Chat Completion there offers you a max_tokens on a reasoning model, meaning the model going nuts with loops or repeating assistant messages there will cost you several dollars.
Thanks for retesting. This looks like a separate follow-up from the original reasoning_effort: "none" + max_completion_tokens issue.
If reasoning_tokens is now 0 and visible text is returned, the original reasoning-token bug looks fixed. The remaining gap you’re seeing is different: usage reports the full token cap, but the delivered text appears to tokenize to fewer visible tokens.
Since the docs say max_completion_tokens / max_output_tokens include generated output tokens, including reasoning tokens, this is worth checking with the server-side trace rather than guessing from local tokenization.
Could you share just one affected request ID and its usage block? That should be enough to verify what was counted versus returned.