Length Finish Reason Error despite not exceeding completion limit

Hello there,

I’m experiencing sporadic errors with the beta.chat.completions.parse endpoint when using gpt-4o-mini-2024-07-18 with structured outputs (openai v1.59.7, Python).

My setup:

  • Temperature: 0.0
  • System message: ~1000 tokens
  • User message: ~5000 tokens (including few-shot examples to guide the model)
  • Pydantic response format (note this is a dummy example and not the names of the real fields I’m using):
  class OutputFormat(BaseModel):
      output1: bool
      output2: bool
      output3: str
  • Expected output size: 100-250 tokens

The fewshot examples I provide show the output format as something like:

{
"output1": True
"output2": False
"output3": <some text>
}

This works the majority. However very sporadically, I have started to observe an error:

raise LengthFinishReasonError(completion=chat_completion) openai.LengthFinishReasonError: Could not parse response content as the length limit was reached - CompletionUsage(completion_tokens=5, prompt_tokens=6116, total_tokens=6121, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=6016))

It is very difficult to replicate the error, it is generally only appearing under a high load. I’ve also noticed that only 5 completion tokens are generated in these error cases, which appears to be just the first token up to and including the “output1:” text in the structured output.

Are there any ideas what could be causing this error? How could I even diagnose the output.

If you are getting an error produced locally when using the parse() method that tries to add a parsed key to the normal response object alongside “content”, the question I would want to be asking is what is the finish_reason that is being returned.

  • stop: a stop token sequence was produced by the model
  • length: max_tokens or max_completion_tokens was hit.

Since max_completion_tokens is newer than the model, needing translation, for that API call, I would send a max_tokens parameter in case this conversion is sometimes not being done or being done incorrectly, always specifying it.

Then, you can improve the framing of the class a bit, giving it a new main name like “mandatory json output format schema”, since it doesn’t actually say that anywhere in the system prompt where the schema is placed.

The mini model becomes worse the longer the input context in “turns”, not better, so I would not give unnecessary multi-shot “chat”, but instead use system prompt examples that actually would be appearing before “# Response formats” is injected at the end of the system prompt with the schema.