Structured output calls fail trying to parse response content

Time to time (not 100% reproducible), after long “thinking” I’m getting error like one below while calling API using structured outputs:

openai.LengthFinishReasonError: Could not parse response content as the length limit was reached - CompletionUsage(completion_tokens=16000, prompt_tokens=3501, total_tokens=19501, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=3328))

I clearly can see that query is about 3501 tokens from which 3328 are cached.
And I’ve allowed completion to be up to 16000 tokens.

Most of the times, queries get immediate and correct response but time to time, fail on this.
Since I’ve wrapped my API calls with retry logic, sometimes it goes through after few retries but this is not always the case.

Wondering is anyone has seen this behaviour and what could be potential solutions/workarounds?

2 Likes

I think I’m facing the same issue.
I am using structured output too. but it failed with some sentences:
Error: Could not parse response content as the length limit was reached

Although my max_tokens=50 and the total_tokens=156

By symptoms above, to me personally, it looks like the temperature is too low with a rather complex task/object definition… So the model ends up repeating the same string in the output, what makes the response go out of max size range and the whole thing fails.

But hard to tell if my gut is right without looking into the whole request data. Maybe pasting the call log would be a great idea to start with?

Thank you for taking a look into this.
Let me explain.

This API call is a part of hybrid context selection strategy for RAG implementation.
Temperature is set to near zero to force model to choose answer from the provided list, not to hallucinate one.
Here is redacted prompt:

    PROMPT_TEMPLATE = """
1. You are an AI agent selecting relevant knowledge base articles for a user inquiry about [REDACTED] app.  
2. [REDACTED] 
3. Use your notepad to enrich the given user inquiry by adding synonyms for key terms, leveraging your knowledge of the [REDACTED]. Maintain the original inquiry while adding these synonyms to make the inquiry more detailed and explicit.
4. Review the enriched inquiry: [ENRICHED INQUIRY HERE] 
5. Review given list of available titles: {all_titles}  
6. Use your knowledge about [REDACTED], and common user issues to determine which titles could include the necessary information to answer the enriched inquiry.  
7. Select exclusively from list above only those titles that are highly relevant to answering the enriched inquiry, strictly prioritizing titles that match key terms or concepts.
8. If no specific titles match, fall back to general titles that still relate to the user's inquiry, such as:
- "how_it_works" - how [REDACTED] works to manage [REDACTED] data ...
[SKIPPED]
"""

Then I inject into this prompt {all_titles} list of document titles to select from (~250).

Answer is expected to be structured and parsed using provided Pydantic base model:

class TitleSelection(BaseModel):
    selected_titles: List[str] = Field(..., description="Selected document titles.")
    enriched_inquiry: str = Field(..., description="Enriched user inquiry")

Here is redacted log record:

2025-01-03 13:54:32,557 - openai_query_processor - DEBUG - Calling OpenAI API with params: {'model': 'gpt-4o-2024-11-20', 'messages': [{'role': 'system', 'content': '1. You are an AI agent selecting relevant knowledge base articles for a user inquiry about [REDACTED]...[SKIPPED] \n5. Review given list of available titles: [\'about-us\', \'how_it_works\', ... [SKIPPED]]}, {'role': 'user', 'content': 'Thank you! Everything is working correctly now with the import process?'}], 'temperature': 1e-06, 'max_completion_tokens': 16000, 'response_format': <class 'src.openai.openai_context_selector.TitleSelection'>}
2025-01-03 13:57:39,608 - api_retry_handler - ERROR - Unexpected error: Could not parse response content as the length limit was reached - CompletionUsage(completion_tokens=16000, prompt_tokens=3505, total_tokens=19505, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=3328))
Traceback (most recent call last):
  File "/agent/src/shared/api_retry_handler.py", line 14, in execute_func
    return func()
           ^^^^^^
  File "/agent/src/openai/openai_query_processor.py", line 35, in <lambda>
    lambda: self._request_func(
            ^^^^^^^^^^^^^^^^^^^
  File "/agent/src/openai/openai_query_processor.py", line 61, in _request_func
    response = api_method(**params)
               ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/resources/beta/chat/completions.py", line 156, in parse
    return self._post(
           ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1280, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 957, in request
    return self._request(
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1063, in _request
    return self._process_response(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1162, in _process_response
    return api_response.parse()
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_response.py", line 319, in parse
    parsed = self._options.post_parser(parsed)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/resources/beta/chat/completions.py", line 150, in parser
    return _parse_chat_completion(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/lib/_parsing/_completions.py", line 72, in parse_chat_completion
    raise LengthFinishReasonError(completion=chat_completion)
openai.LengthFinishReasonError: Could not parse response content as the length limit was reached - CompletionUsage(completion_tokens=16000, prompt_tokens=3505, total_tokens=19505, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=3328))

Thank you for the detailed explanation! Let me share a few thoughts on how I would approach this based on my experience working with RAG workflows in a legal document analysis tool.


  1. Addressing the Root Problem: Data Preparation Before Embedding
    From what I see, the main issue you’re encountering comes from addressing the problem too late in the process. Instead of focusing on the model’s output parsing, you could solve the root issue by adjusting how you prepare your data before embedding it into your vector database.

By preparing the knowledge base in advance, you can improve the relevance of your search results and make your solution cheaper and more efficient.


  1. Preparing the Knowledge Base for Embedding
    Before embedding any tickets, I would recommend the following steps:
  2. Semantic Chunking: Split each ticket into smaller, closed-idea chunks.
  3. Chunk Summaries: Add short summaries to describe the key content of each chunk.
  4. Relationships Between Chunks: Identify and establish connections between chunks to create a hierarchical structure within each ticket.
  5. Ticket Summaries: Provide a brief outline summarizing what problem each ticket solves.

Embedding chunks along with their summaries and hierarchical outlines will ensure that your vectors are better aligned with potential user inquiries.


  1. Identifying the User’s Intent and Problem Description
    Once you’ve embedded the data, the next step is to accurately identify what the user is trying to solve. This is the most critical part.

The LLM should transform the user inquiry into a clear problem description based on the input keywords. This is key because the LLM needs to infer the user’s intent, not just match keywords directly. It should summarize the problem that the user is trying to solve and then use that as the search query.

In this workflow:

  • Input: A description of the problem provided by the user.
  • Searchable Items: Ticket descriptions that explain how to solve the identified problem.

This step ensures that the system is matching user inquiries with relevant solutions.


  1. Pre-selecting the Most Relevant Search Results
    Once you’ve retrieved potential solutions using vector search (cosine similarity), I recommend introducing a pre-selection step to reduce noise in the results.

Use an LLM to evaluate the usefulness of each search result and assign a score from 0 to 9, where 9 is highly relevant. This step will:

  • Reduce the number of items passed to the answering model.
  • Ensure only closely related items are used in the final response.

This filtering process improves the quality of the answer by eliminating irrelevant content.


  1. Forming the Final Answer and Grounding It
    After pre-selecting the most relevant items:
  • Use them to form the prompt for generating the final answer.
  • Add a grounding step to verify the correctness of the generated answer against the selected items and context.

Grounding is essential to:

  • Confirm that the answer is based on the provided knowledge base content.
  • Provide references to the most relevant items when necessary.

  1. Final Thoughts
    This approach has worked well for me in practice. It’s more cost-effective, allows parallel processing of chunks before forming the final prompt, and significantly improves the accuracy and relevance of the generated answer.

Of course, the above suggestions apply if I understood your problem correctly. I hope this helps!

(Note: AI helped me to put my thoughts in form, but I do agree with all the points mentioned above.)

Your prompt with the injected data is probably too long to work well, but I’m wondering why you would’t let a RAG like (including using the OpenAI solution) do the ‘selecting’ it sounds like the perfect solution for that?
Ie upload those 250 titles / documents and then run a RAG query, include possibly a reranker
I think most models will not do very will on this type of query at the moment.

1 Like

“Thinking” might indicate use of ‘o1’ model, which spends your max_completion_tokens on internal reasoning also.

This indicates that the output was terminated due to the max_completion_tokens settings we can see your setting here:

If indeed using o1, you may be terminating your own output prematurely, never getting a closure of the JSON. You must set much higher than the response you wish to receive due to internal consumption.

Or the AI has gone into a looping pattern.

If beta parse() fails, you can also get the response object “content” field and see what was received.

A temperature >0.7 or so, along with a bit of frequency_penalty, on non-o1 models (where this is isn’t otherwise predetermined for you), will tend to break up loops eventually, even if the output quality then is poor.

The AI models like gpt-4o won’t normally write anywhere near the maximum output length. Even for a justified large output job, they’ll wrap up prematurely.

You cut off the output before the JSON could be finished also.