What is going on with the GPT-5 API?

Internal reasoning uses up that max_completion_token budget also. The parameter is you telling the endpoint when to turn off the model.

Try setting that to 20000 or not sending it.

20k should be plenty to get a response after the thinking, and if still getting a finish_reason of “length”, that means the reasoning went far longer than normal, the AI caught in a thinking loop or other error.

With all due respect, you realize how dumb this is right? The reason for capping the amount of tokens is to not spend 20,000 tokens per response. The parameter is there, it should work and produce an output (even if truncated), as all GPT models have in the past . Full stop.

gpt-5-nano takes, embarrassingly, 2000 tokens to produce a list of 6 phrases at default reasoning. gpt-4.1-nano completes this task in an superior way for 200 tokens max.

They should simply remove this parameter entirely if the model only works if you have to set 20,000+ max tokens for every response.

If reasoning “minimal” produced some form of decent results, it would be fine (as reasoning tokens wouldn’t be taken into account), but it doesn’t.

Unfortunately OpenAI has stripped GPT-4.1 of all it’s ability, temperature, frequency penalty, presence penalty, and left us with a generic, token-hungry model called GPT-5, who’s outputs are not much better than it’s predecessor.

As a concession, they give us the reasoning parameter, which is when set to “minimal” outputs hilariously bad content. This isn’t a frontier model, it’s a step backwards (the API at least) for everything except maybe coding.

This is why so many people wanted 4o back, and why gpt-5-nano is a worthless model who’s existence is pointless.

1 Like

Reasoning models documentation:

If the generated tokens reach the context window limit or the max_output_tokens value you’ve set, you’ll receive a response with a status of incomplete and incomplete_details with reason set to max_output_tokens. This might occur before any visible output tokens are produced, meaning you could incur costs for input and reasoning tokens without receiving a visible response.

To prevent this, ensure there’s sufficient space in the context window or adjust the max_output_tokens value to a higher number. OpenAI recommends reserving at least 25,000 tokens for reasoning and outputs when you start experimenting with these models. As you become familiar with the number of reasoning tokens your prompts require, you can adjust this buffer accordingly.

Yes, so we need to spend potentially 25,000 tokens for something gpt-4.1-nano could do for 2,000 tokens. If gpt-4.1-nano doesn’t need to spend so many reasoning tokens to produce a superior output, why does gpt-5-nano? Furthermore, why is gpt-5-nano inferior to even gpt-3.5-turbo with reasoning effort set to minimal? Make it make sense.

For example, if you tell gpt-5-nano to output a response in another language with reasoning minimal (and a prompt that 4.1-nano can handle with ease), often times it will leave random words in English. I literally do not understand how it’s possible for this model to be so bad.

It is not like gpt-4.1-nano is going away tomorrow, give it some time for things to settle.

If and when it goes away, it will probably be informed in the deprecations page with some prior notice and by then, hopefully we will probably have a better solution.

You allow no reasoning on gpt-5-nano, which is powered by reasoning? You get a model with half-price input compared to gpt-4.1-nano, and no place to exhibit its power: reasoning and deliberation on the cheap.

Reasoning: 25000 tokens for a penny.

This is scoring vs allowed accumulated cost when reasoning, benchmarking independently run by SWE-Bench folks, using swe-bench mini agent.


Allthough this is specialized, agentic tasks, you can see that nano will give higher performance for the cost - when your cost cap is extremely low (and not by shutting off its output tokens…).

3 Likes

Benchmarks are all but useless for most real-world applications. The fact is that gpt-4.1-nano produces better results than gpt-5-nano for 10x less tokens. That’s in additional to not having the temperature, frequency penalty, and presence penalty parameters. This model has virtually no use case so far as I can tell.

And the fact is that gpt-4.1-nano will be going away.

1 Like

Hopefully not for a long time, they still allow API submissions to ancient models like gpt-3.5-turbo etc, so we should get a good few years out of 4.1-nano. We can only hope that they’ll either have a new non-reasoning model (by popular demand), or that the reasoning effort “minimal” on whichever model is being used by then is as good as 4.1-nano.

10x less tokens mean also (almost) 10x less time.. Time is never considered in those benchmarks.

Yes, 4.1 nano will eventually be deprecated, but there’s no sign that this will happen anytime soon.

GPT-5 isn’t just a model—it’s also part of a system designed to give ChatGPT users a simpler experience. Behind the scenes, a router decides which model to use for a given prompt.

For developers, it’s usually better to pick the right model directly rather than rely on a system that swaps models based on unknown criteria. That’s why I see no need to move to GPT-5 in the API unless it’s clearly better for the specific use case. If it’s not, the legacy models are still available.

I understand customers will ask when they get access to the new model, and we’ll need to manage that messaging. And while the sudden removal of legacy models from ChatGPT was a surprise, it’s important to remember there’s a clear distinction between the consumer product and the API services.

The API does not seem very stable. Sometimes it takes 3 minutes, sometimes 6 minutes for the same amount of tokens output. Sometimes I get a 529 gateway error when using exactly the same call. Sometimes the output is just empty (see below) but still using tokens. My code has been the same for the last week since GPT-5 was rolled out.

i’ve stopped generating output with gpt-5 today, it’s just too unstable. Half of the times I get no outputs. Earlier this week it worked (mostly) fine.

5 Likes

I am getting empty responses from both the GPT OSS models as well. The inference is from Fireworks.

And it calls only one tool at a time. Is there a way to control this behavior?

1 Like

GPT-5 Empty Response Hack

Both max_tokens and max_completion_tokens limit the combined total of reasoning and response tokens. Prefer using max_completion_tokens. We get an empty response if all available tokens are consumed during reasoning. This happens more than usual so be careful!

We can only set reasoning effort to minimal, low, medium, or high. In my testing, minimal appears to disable reasoning entirely. It’s best to set it to low for now, because with medium the model generates thousands of reasoning tokens, making it unreliable due to token exhaustion.

We still need to set max_tokens high enough to allow for occasional reasoning. I’m currently setting it to 4,000.

Also, set parallel_tool_calls to true for better performance when calling tools in parallel.

4 Likes

Yep, thought it might be me but glad to know I’m not the only one.

1 Like

Do you have reasoning parameter? I used to call client.chat.completions.create, and all my tokens were being consumed by reasoning. After switching to client.responses.create with the reasoning parameter, it seems to work much better.

You can check GPT-5 New Params and Tools 4. Minimal Reasoning (I can’t include link)

Do you have the latest Chat Completions API reference?

reasoning_effort="medium" as a base API parameter, taking “minimal”, “low”, “medium”, “high”.

Same. It’s incredibly frustrating and only with GPT-5 models.