Question about CompletionUsage error with structured outputs

Hello all,
I am running into a situation that I would love to troubleshoot, I have found many similar posts about this topic, but I have struggled to be able to apply them to my situation so thank you in advance for any help you can give me!

Note: I am calling gpt 4.1 mini, and the company I work for has an RPM of 30,000 tokens

I am currently using the openai api in the following way: I have a bunch of .txt files that I read in, insert the .txt file content into a prompt, and send to chatgpt one at a time to get a structured output. The .txt files range in size but generally are not very big at all, in terms of tokens, the bigger files combined with the prompt contain a max ~6000 tokens, but some are as small as ~3000. Most of the time this works fine, however after running a decent amount of files I started to run into the following error somewhat randomly:

Could not parse response content as the length limit was reached - CompletionUsage(completion_tokens=32768, prompt_tokens=5141, total_tokens=37909, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=2048))

One thing to note is that In my code, I have it so if this error pops up the code sleeps for 30 seconds, then tries that file again. I have it so it will re-run the same file a max 5 times before it moves on. Sometimes re-running that file will fix the issue and the file will be processed but other times it will try the 5 times and still fail to return a valid output. For example, one of the times I was trying to recreate this error with the same file that caused the above CompletionUsage error ran successfully after 3 attempts and had the following usage information, but this doesn’t happen every time:

completion.usage {‘completion_tokens’: 916, ‘prompt_tokens’: 5141, ‘total_tokens’: 6057, ‘completion_tokens_details’: {‘accepted_prediction_tokens’: 0, ‘audio_tokens’: 0, ‘reasoning_tokens’: 0, ‘rejected_prediction_tokens’: 0}, ‘prompt_tokens_details’: {‘audio_tokens’: 0, ‘cached_tokens’: 5120}}

What is also strange about this is that, as you can see, the completion tokens are a crazy amount less when the same file runs successfully. I tried to use max_completion_tokens and set that to 23000 but the same error occurred, just now at the new max_tokens. This error is both consistent and inconsistent as the same files will throw the CompletionUsage error, however sometimes those files won’t cause any issues at all. And then again, even if they throw the error, re-running it sometimes fixes it and other times doesn’t. From what I can tell, there is nothing strange or different about the files that are sometimes causing these issues vs the ones that don’t; and like I said these files are not big so I have looked and compared the files that do this/don’t do this.

What I was really hoping to find was a possible way to have openai still return an output of some kind when this error occurs so I can see what their output looks like because clearly there is a problem. Without that information I feel lost but as of now I haven’t been able to find a way to do that. If that is not possible I was wondering if anyone has any suggestions as to what they think could be happening and/or suggestions on how to troubleshoot.

edit: I completely forgot to add, I also want to know if anyone can tell me whether or not we are still getting charged for the tokens when the CompletionUsage error occurs!

Since you appear to have received a Completion object, have you tried looking into the dashboard logs using the completion ID?

Or in case you don’t have access to the dashboard you can also retrieve the full completion object using the ID, to inspect its raw contents as text.

And yes, you are probably getting billed for it.

Hi, thank you for your response!
I ran my code again but this time printing the completion ids and unfortunately, when I run into this CompletionUsage error there actually is no completion object (as far as I can tell). The only thing openai returns is the “could not parse…” message with the CompletionUsage object which does not have an id. This is the snippet of code:

except Exception as e:
    print(f"An unexpected error occurred in completion_openai: {e}")
    try:
        print(f'id of completion: {completion.id}')
    except Exception as e:
        print(f'no completion object: {e}')

which returned the following:

An unexpected error occurred in completion_openai: Could not parse response content as the length limit was reached - CompletionUsage(completion_tokens=24858, prompt_tokens=5141, total_tokens=29999, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=2048))
no completion object: local variable ‘completion’ referenced before assignment

**I added max_completetion_tokens = 24858 to my request so that the total tokens would be less than our RPM of 30000, just to see if that would make a difference…it did not :frowning:

Do you have access to your dashboard logs?

Look at the output consumption reported in the quoted usage output. The gpt-4.1-mini model has gone into a loop, and likely dumped a whole bunch of tabs or linefeed carriage returns in a never-ending loop. This model, when well behaved and not gone off the rails simply won’t write anything approaching 2k tokens anyway.

This much I can infer and guess where it is not mentioned:

  • using the Chat Completions endpoint.
  • using the parse() method of sending a Pydantic BaseModel as the response_format parameter
  • using the parsed object in the response.
  • initially, no max_completion_tokens to limit your cost when the AI model breaks and writes much more than a valid response.

This client-side SDK usage will fail if a JSON is not closed. 30000 useless characters in one of the strings the AI writes is a JSON that is completely useless to you anyway.

The fault is the AI model gpt-4.1-mini. This has been reported over and over, poor with structured outputs, failing in producing strings in ways that other models would only fail if using the older json_object type of output that you just explain to the AI.

Simply: Try gpt-4o-2024-11-20 the same way. Set the top_p parameter to 0.2 for reliable results. Then when you get reliable performance, you can try reducing the costs with gpt-4o-mini.

You can tweak the output with a frequency_penalty of about 0.1 or so, to break up long loops of the same character. That can’t fix what the AI has already done, you still have bad writing.

More extravagantly: Don’t use the SDK’s “parse()” method, don’t use Pydantic and don’t “fail” so harshly. Write a strict schema yourself as JSON for the response_format: json_schema type of output, use client.chat.completions.create(), and then get the “content” where the AI has followed the schema. You then can even use unclosed JSON and strip the characters you observe the model is making based on your inputs and task.

Overall: gpt-4.1-mini is a lost cause, bad with function calling, bad with structured outputs, sure to cost you the maximum often by writing loops of bad characters in your strict JSON. Set the max_completion_tokens to 2500 simply so that the “crazy” cost isn’t the maximum cost.

Thank you so much for all of the information; you are pretty much spot on with everything. I had no idea that there was a larger issue with gpt 4.1 mini specifically, so I really appreciate that insight! It’s really unfortunate that this is such a widespread problem. It looks as though moving forward we are going to work with 4o mini! Thanks again!

1 Like