Length Finish Reason Error despite not exceeding completion limit

ian.astalosh · March 4, 2025, 4:27am

Hello there,

I’m experiencing sporadic errors with the beta.chat.completions.parse endpoint when using gpt-4o-mini-2024-07-18 with structured outputs (openai v1.59.7, Python).

My setup:

Temperature: 0.0
System message: ~1000 tokens
User message: ~5000 tokens (including few-shot examples to guide the model)
Pydantic response format (note this is a dummy example and not the names of the real fields I’m using):

  class OutputFormat(BaseModel):
      output1: bool
      output2: bool
      output3: str

Expected output size: 100-250 tokens

The fewshot examples I provide show the output format as something like:

{
"output1": True
"output2": False
"output3": <some text>
}

This works the majority. However very sporadically, I have started to observe an error:

raise LengthFinishReasonError(completion=chat_completion) openai.LengthFinishReasonError: Could not parse response content as the length limit was reached - CompletionUsage(completion_tokens=5, prompt_tokens=6116, total_tokens=6121, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=6016))

It is very difficult to replicate the error, it is generally only appearing under a high load. I’ve also noticed that only 5 completion tokens are generated in these error cases, which appears to be just the first token up to and including the “output1:” text in the structured output.

Are there any ideas what could be causing this error? How could I even diagnose the output.

_j · March 4, 2025, 9:50am

If you are getting an error produced locally when using the parse() method that tries to add a parsed key to the normal response object alongside “content”, the question I would want to be asking is what is the finish_reason that is being returned.

stop: a stop token sequence was produced by the model
length: max_tokens or max_completion_tokens was hit.

Since max_completion_tokens is newer than the model, needing translation, for that API call, I would send a max_tokens parameter in case this conversion is sometimes not being done or being done incorrectly, always specifying it.

Then, you can improve the framing of the class a bit, giving it a new main name like “mandatory json output format schema”, since it doesn’t actually say that anywhere in the system prompt where the schema is placed.

The mini model becomes worse the longer the input context in “turns”, not better, so I would not give unnecessary multi-shot “chat”, but instead use system prompt examples that actually would be appearing before “# Response formats” is injected at the end of the system prompt with the schema.

Topic		Replies	Views
Intermittent “length limit was reached” error using GPT-4o-mini via Azure — even with short prompts and completions Bugs	0	98	March 25, 2025
Inconsistent Token Limits with “o3-mini-2025-01-31” Model—Empty Response Despite Supposed Large Context? API api , limitations , system-limitation	2	686	March 4, 2025
Chat Completions output cutting off without hitting max_tokens limit API gpt-35-turbo , api , token , gpt-0125	1	800	July 14, 2024
O1-mini model output '' with finish_reason of length Bugs o1-preview	1	364	November 12, 2024
Structured Output Issue in GPT-4o API – Response Truncation at Specific Index API api , structured-output	1	131	March 19, 2025

Length Finish Reason Error despite not exceeding completion limit

Related topics