O4-mini returns empty response because reasoning token used all the completion token

Hi, this is Lifan from Aissist. I’ve noticed that when using O4-mini, there’s a small but recurring issue where the response is empty and the finish_reason is length.

In the example below, I set the max completion tokens to 3072. However, the model used all 3072 tokens as reasoning tokens, leaving none for actual content generation. I initially had the limit set to 2048 and observed the same issue, so I increased it to 3072—but it’s still happening. I was setting the reasoning effort to low, and sometimes retry the same request can solve the issue, but not always.

Does anyone know why this is occurring, or if there’s a way to prevent all tokens from being consumed purely for reasoning?

ChatCompletion(id=‘chatcmpl-CHXjJdaUN3ahZBpet3wPedM7ZtSRe’, choices=[Choice(finish_reason=‘length’, index=0, logprobs=None, message=ChatCompletionMessage(content=‘’, refusal=None, role=‘assistant’, audio=None, function_call=None, tool_calls=None, annotations=), content_filter_results={})], created=1758297269, model=‘o4-mini-2025-04-16’, object=‘chat.completion’, service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=3072, prompt_tokens=10766, total_tokens=13838, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=3072, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)), prompt_filter_results=[{‘prompt_index’: 0, ‘content_filter_results’: {‘hate’: {‘filtered’: False, ‘severity’: ‘safe’}, ‘self_harm’: {‘filtered’: False, ‘severity’: ‘safe’}, ‘sexual’: {‘filtered’: False, ‘severity’: ‘safe’}, ‘violence’: {‘filtered’: False, ‘severity’: ‘safe’}}}])

The API’s max_tokens parameter was renamed to max_completion_tokens for just this reason.

It is a signal that reflects that you are now not specifying the maximum length of the response that you want to receive, but rather, the maximum budget you want to pay for output tokens, which includes both the seen output and also the internal reasoning billed as output.

If you want to reduce the thinking a bit, use the “reasoning_effort” parameter on Chat Completions. This communicates with the AI "don’t think so hard before responding.

Better is to just set this high if you are going to send it at all, where it only would prevent runaway token generation - such as 30000.

Hi, we are using ChatComplete, and we are using max_completion_tokens here are the parameters we used for the request:
‘escalation_strategy’: ‘neutral’, ‘temperature’: 0.6, ‘top_p’: 0.3, ‘frequency_penalty’: 0.6, ‘presence_penalty’: 0.6, ‘max_completion_tokens’: 3072, ‘reasoning_effort’: ‘low’