Hi everyone,
I’m running into an intermittent issue when using the gpt-4o-mini model via Azure OpenAI, and I can’t quite figure out what’s going on.
Context
• I’m analyzing a database and making many API calls in a loop.
• Each request uses a prompt of about 800–900 tokens.
• The expected completion should be very small (around 200–300 tokens maximum).
• I explicitly set max_tokens=4000 (well above what’s needed, but safely under all limits).
• I’m also using the response_format parameter to get a structured output.
• The Azure deployment has a TPM limit of 30k, and I’ve also tested this with the gpt-4o model (which has a 450k TPM limit), and the issue still occurs.
The error
Occasionally — not always, but randomly during one of the loop iterations — I get this error:
Could not parse response content as the length limit was reached -
CompletionUsage(
completion_tokens=4000,
prompt_tokens=874,
total_tokens=4874,
completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, …),
prompt_tokens_details=PromptTokensDetails(…)
)
This is confusing because:
• The total tokens are clearly below any hard context limit (only ~4.8k total).
• The max_tokens is set.
• The completion should not exceed 300 tokens anyway.
• The same call can succeed many times and then fail unexpectedly.
It’s tedious to reproduce, because I need to let the loop run until one of the calls randomly triggers the error.
Example code
completion = client.beta.chat.completions.parse(
model = “gpt-4o-mini”,
messages = [
{“role”: “system”, “content”: prompt1, “type”: “text”},
{“role”: “system”, “content”: prompt2, “type”: “text”},
{“role”: “user”, “content”: str(content), “type”: “text”},
],
response_format = modelResponse,
temperature = 0.2,
top_p = 0.9,
logprobs = True,
max_tokens = 4000
)
I don’t know what to do to solve this problem.
Thanks in advance.