Question about CompletionUsage error with structured outputs

Look at the output consumption reported in the quoted usage output. The gpt-4.1-mini model has gone into a loop, and likely dumped a whole bunch of tabs or linefeed carriage returns in a never-ending loop. This model, when well behaved and not gone off the rails simply won’t write anything approaching 2k tokens anyway.

This much I can infer and guess where it is not mentioned:

  • using the Chat Completions endpoint.
  • using the parse() method of sending a Pydantic BaseModel as the response_format parameter
  • using the parsed object in the response.
  • initially, no max_completion_tokens to limit your cost when the AI model breaks and writes much more than a valid response.

This client-side SDK usage will fail if a JSON is not closed. 30000 useless characters in one of the strings the AI writes is a JSON that is completely useless to you anyway.

The fault is the AI model gpt-4.1-mini. This has been reported over and over, poor with structured outputs, failing in producing strings in ways that other models would only fail if using the older json_object type of output that you just explain to the AI.

Simply: Try gpt-4o-2024-11-20 the same way. Set the top_p parameter to 0.2 for reliable results. Then when you get reliable performance, you can try reducing the costs with gpt-4o-mini.

You can tweak the output with a frequency_penalty of about 0.1 or so, to break up long loops of the same character. That can’t fix what the AI has already done, you still have bad writing.

More extravagantly: Don’t use the SDK’s “parse()” method, don’t use Pydantic and don’t “fail” so harshly. Write a strict schema yourself as JSON for the response_format: json_schema type of output, use client.chat.completions.create(), and then get the “content” where the AI has followed the schema. You then can even use unclosed JSON and strip the characters you observe the model is making based on your inputs and task.

Overall: gpt-4.1-mini is a lost cause, bad with function calling, bad with structured outputs, sure to cost you the maximum often by writing loops of bad characters in your strict JSON. Set the max_completion_tokens to 2500 simply so that the “crazy” cost isn’t the maximum cost.