Structured output calls fail trying to parse response content

Time to time (not 100% reproducible), after long “thinking” I’m getting error like one below while calling API using structured outputs:

openai.LengthFinishReasonError: Could not parse response content as the length limit was reached - CompletionUsage(completion_tokens=16000, prompt_tokens=3501, total_tokens=19501, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=3328))

I clearly can see that query is about 3501 tokens from which 3328 are cached.
And I’ve allowed completion to be up to 16000 tokens.

Most of the times, queries get immediate and correct response but time to time, fail on this.
Since I’ve wrapped my API calls with retry logic, sometimes it goes through after few retries but this is not always the case.

Wondering is anyone has seen this behaviour and what could be potential solutions/workarounds?

2 Likes

I think I’m facing the same issue.
I am using structured output too. but it failed with some sentences:
Error: Could not parse response content as the length limit was reached

Although my max_tokens=50 and the total_tokens=156

By symptoms above, to me personally, it looks like the temperature is too low with a rather complex task/object definition… So the model ends up repeating the same string in the output, what makes the response go out of max size range and the whole thing fails.

But hard to tell if my gut is right without looking into the whole request data. Maybe pasting the call log would be a great idea to start with?

Thank you for taking a look into this.
Let me explain.

This API call is a part of hybrid context selection strategy for RAG implementation.
Temperature is set to near zero to force model to choose answer from the provided list, not to hallucinate one.
Here is redacted prompt:

    PROMPT_TEMPLATE = """
1. You are an AI agent selecting relevant knowledge base articles for a user inquiry about [REDACTED] app.  
2. [REDACTED] 
3. Use your notepad to enrich the given user inquiry by adding synonyms for key terms, leveraging your knowledge of the [REDACTED]. Maintain the original inquiry while adding these synonyms to make the inquiry more detailed and explicit.
4. Review the enriched inquiry: [ENRICHED INQUIRY HERE] 
5. Review given list of available titles: {all_titles}  
6. Use your knowledge about [REDACTED], and common user issues to determine which titles could include the necessary information to answer the enriched inquiry.  
7. Select exclusively from list above only those titles that are highly relevant to answering the enriched inquiry, strictly prioritizing titles that match key terms or concepts.
8. If no specific titles match, fall back to general titles that still relate to the user's inquiry, such as:
- "how_it_works" - how [REDACTED] works to manage [REDACTED] data ...
[SKIPPED]
"""

Then I inject into this prompt {all_titles} list of document titles to select from (~250).

Answer is expected to be structured and parsed using provided Pydantic base model:

class TitleSelection(BaseModel):
    selected_titles: List[str] = Field(..., description="Selected document titles.")
    enriched_inquiry: str = Field(..., description="Enriched user inquiry")

Here is redacted log record:

2025-01-03 13:54:32,557 - openai_query_processor - DEBUG - Calling OpenAI API with params: {'model': 'gpt-4o-2024-11-20', 'messages': [{'role': 'system', 'content': '1. You are an AI agent selecting relevant knowledge base articles for a user inquiry about [REDACTED]...[SKIPPED] \n5. Review given list of available titles: [\'about-us\', \'how_it_works\', ... [SKIPPED]]}, {'role': 'user', 'content': 'Thank you! Everything is working correctly now with the import process?'}], 'temperature': 1e-06, 'max_completion_tokens': 16000, 'response_format': <class 'src.openai.openai_context_selector.TitleSelection'>}
2025-01-03 13:57:39,608 - api_retry_handler - ERROR - Unexpected error: Could not parse response content as the length limit was reached - CompletionUsage(completion_tokens=16000, prompt_tokens=3505, total_tokens=19505, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=3328))
Traceback (most recent call last):
  File "/agent/src/shared/api_retry_handler.py", line 14, in execute_func
    return func()
           ^^^^^^
  File "/agent/src/openai/openai_query_processor.py", line 35, in <lambda>
    lambda: self._request_func(
            ^^^^^^^^^^^^^^^^^^^
  File "/agent/src/openai/openai_query_processor.py", line 61, in _request_func
    response = api_method(**params)
               ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/resources/beta/chat/completions.py", line 156, in parse
    return self._post(
           ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1280, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 957, in request
    return self._request(
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1063, in _request
    return self._process_response(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1162, in _process_response
    return api_response.parse()
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_response.py", line 319, in parse
    parsed = self._options.post_parser(parsed)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/resources/beta/chat/completions.py", line 150, in parser
    return _parse_chat_completion(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/lib/_parsing/_completions.py", line 72, in parse_chat_completion
    raise LengthFinishReasonError(completion=chat_completion)
openai.LengthFinishReasonError: Could not parse response content as the length limit was reached - CompletionUsage(completion_tokens=16000, prompt_tokens=3505, total_tokens=19505, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=3328))

Thank you for the detailed explanation! Let me share a few thoughts on how I would approach this based on my experience working with RAG workflows in a legal document analysis tool.


  1. Addressing the Root Problem: Data Preparation Before Embedding
    From what I see, the main issue you’re encountering comes from addressing the problem too late in the process. Instead of focusing on the model’s output parsing, you could solve the root issue by adjusting how you prepare your data before embedding it into your vector database.

By preparing the knowledge base in advance, you can improve the relevance of your search results and make your solution cheaper and more efficient.


  1. Preparing the Knowledge Base for Embedding
    Before embedding any tickets, I would recommend the following steps:
  2. Semantic Chunking: Split each ticket into smaller, closed-idea chunks.
  3. Chunk Summaries: Add short summaries to describe the key content of each chunk.
  4. Relationships Between Chunks: Identify and establish connections between chunks to create a hierarchical structure within each ticket.
  5. Ticket Summaries: Provide a brief outline summarizing what problem each ticket solves.

Embedding chunks along with their summaries and hierarchical outlines will ensure that your vectors are better aligned with potential user inquiries.


  1. Identifying the User’s Intent and Problem Description
    Once you’ve embedded the data, the next step is to accurately identify what the user is trying to solve. This is the most critical part.

The LLM should transform the user inquiry into a clear problem description based on the input keywords. This is key because the LLM needs to infer the user’s intent, not just match keywords directly. It should summarize the problem that the user is trying to solve and then use that as the search query.

In this workflow:

  • Input: A description of the problem provided by the user.
  • Searchable Items: Ticket descriptions that explain how to solve the identified problem.

This step ensures that the system is matching user inquiries with relevant solutions.


  1. Pre-selecting the Most Relevant Search Results
    Once you’ve retrieved potential solutions using vector search (cosine similarity), I recommend introducing a pre-selection step to reduce noise in the results.

Use an LLM to evaluate the usefulness of each search result and assign a score from 0 to 9, where 9 is highly relevant. This step will:

  • Reduce the number of items passed to the answering model.
  • Ensure only closely related items are used in the final response.

This filtering process improves the quality of the answer by eliminating irrelevant content.


  1. Forming the Final Answer and Grounding It
    After pre-selecting the most relevant items:
  • Use them to form the prompt for generating the final answer.
  • Add a grounding step to verify the correctness of the generated answer against the selected items and context.

Grounding is essential to:

  • Confirm that the answer is based on the provided knowledge base content.
  • Provide references to the most relevant items when necessary.

  1. Final Thoughts
    This approach has worked well for me in practice. It’s more cost-effective, allows parallel processing of chunks before forming the final prompt, and significantly improves the accuracy and relevance of the generated answer.

Of course, the above suggestions apply if I understood your problem correctly. I hope this helps!

(Note: AI helped me to put my thoughts in form, but I do agree with all the points mentioned above.)

Your prompt with the injected data is probably too long to work well, but I’m wondering why you would’t let a RAG like (including using the OpenAI solution) do the ‘selecting’ it sounds like the perfect solution for that?
Ie upload those 250 titles / documents and then run a RAG query, include possibly a reranker
I think most models will not do very will on this type of query at the moment.

2 Likes

“Thinking” might indicate use of ‘o1’ model, which spends your max_completion_tokens on internal reasoning also.

This indicates that the output was terminated due to the max_completion_tokens settings we can see your setting here:

If indeed using o1, you may be terminating your own output prematurely, never getting a closure of the JSON. You must set much higher than the response you wish to receive due to internal consumption.

Or the AI has gone into a looping pattern.

If beta parse() fails, you can also get the response object “content” field and see what was received.

A temperature >0.7 or so, along with a bit of frequency_penalty, on non-o1 models (where this is isn’t otherwise predetermined for you), will tend to break up loops eventually, even if the output quality then is poor.

The AI models like gpt-4o won’t normally write anywhere near the maximum output length. Even for a justified large output job, they’ll wrap up prematurely.

You cut off the output before the JSON could be finished also.

As I wrote earlier, this is just a one of the steps in the hybrid context selection process.
In fact, relying just on embedding distances in many cases will miss appropriate chunks since users really don’t care about language and terms used. We have users across the globe that are not native English speakers and tend to express their issues in a way they can most often missing relevant keywords. That’s the reason I’ve built this additional step to purely semantically choose titles from the list.
There are no technical limitations to prompt length, you can use all context window.
I’m not looking for workarounds. I specifically looking on how to resolve this issue which by my opinion is BUG.

“Thinking” might indicate use of ‘o1’ model, which spends your max_completion_tokens on internal reasoning also.

Nope, I used gpt-4o-2024-11-20:
2025-01-03 13:54:32,557 - openai_query_processor - DEBUG - Calling OpenAI API with params: {'model': 'gpt-4o-2024-11-20',

This indicates that the output was terminated due to the max_completion_tokens settings we can see your setting here:

Nope:
It indicates that token amount on OpenAI side is miscalculated, this is essentially a BUG.

Yes, AI goes into looping pattern. I noted your suggestion playing with temperature and might filter out hallucinated titles later.

Thank you for detailed reply.

Allow me to elaborate on this.

1. Addressing the Root Problem: Data Preparation Before Embedding

Data is prepared with the highest quality and stored in vector db. This contains all company’s KB articles from web - parsed, formatted, and chunked.
This is not a root problem.

The root problem is how to select these chunks for inclusion into context.

2. Semantic Chunking: Split each ticket into smaller, closed-idea chunks.

Please explain what you mean by “Semantic Chunking“: Split each ticket?

I use the term chunks/chunking for context store, you are referring to user inquiry?
For instance, user asks: “Can I update UPC?” There is nothing to split. This is about using alternative term for “Barcode”. And this is something where embeddings really fail.

4. Chunk Summaries

Won’t work in my case since some KB articles are long technical documents having long lists of data field definitions/descriptions and use examples, which could be hundreds (like API documentation). There is no point in writing summary naming all these fields which will make summary the same length as original chunk content. That’s the reason each document/tutorial (that might be chunked) has a very clear and descriptive title, e.g. “tutorials_how_to_change_existing_shopify_orders_status_to_paid”

5. Relationships Between Chunks

This is interesting. Could you please explain more about this topic? I really don’t get how these relationships could be “described” and used.

In our case, embeddings are Markdown documents perfectly split by headings, and if chunked retaining heading structure to clearly identify chunk’s place in hierarchy.
Sometimes, for consistency, certain documents are split only down to defined heading level in order to ensure specific chapter is complete and sufficient to grasp the totality of a given topic.

Would this be a sufficient “relationship” definition?

6. Ticket Summaries

Please explain what you mean by the term “ticket”.
For me “ticket” is a support case with its ID, actors, and conversation, consisting of user inquiries and assistant responses.

3. Identifying the User’s Intent and Problem Description

This aspect is solved via assistant prompt instructing how to identify the problem using an iterative process.
I used to call agent in order to reformulate the first question in the conversation and clearly noticed missing key points by LLM with the first shot. One of the reasons was actually loosing critical keywords and replacing them with other similar terms that do not translate to closest vector distances.

Now with “title_selector” strategy I provide a complete list of document titles that essentially define a dictionary of our terms or sort of mind-map elements. So here LLM’s task is to map terms used in user inquiry to terms listed as a set of available articles, and this is done quite well. This is exactly the place to map user intent to articles potentially containing answers to the user’s problem.

From my experience, letting LLM transform the user inquiry very often slips off the original thought.
The exception is query condensing in ongoing conversations to retain conversation context.

4. Pre-selecting the Most Relevant Search Results

If I got it right, you are talking about context reranking here?
Probably this is something I could implement, but for now, I just add all chunks collected by different selector strategies to the LLM query. Token space is not an issue anymore, yet I hold to certain limits to avoid drift.
On average my context lengths vary between 10-20K.
Perhaps with reranking, I could reduce it to half but why to bother? LLM is good at sorting out what is relevant and using it. I don’t have controversial documents - they all complement each other in the “understanding” of the broader picture.

5. Forming the Final Answer and Grounding It

I strongly doubt that validating the generated response against the same context proves the correctness of the answer. Just think about getting biased assessments within the same resonance bubble.

What could be alternative grounding strategies? (perhaps this question alone is worth a separate thread :slight_smile:

Yet, the initial issue is still open and I believe it is quite technical.

Let’s define it as follows:

LLM goes into loops processing structured output inquiry and miscalculates necessary output tokens (probably) adding to the counter those used for calculating/preparing the answer, not the answer itself.

Can we resolve this one?

p.s. ATM I’m handling these exceptions and resolving them with simple retry logic, and not looking for workarounds.

I highly appreciate your time and efforts to answer my issue.

Could you show an example of what the stored object looks like and what fields are embedded and how? Also, what are the often queries sent to your rag engine? Then I’ll be able to tell if that’s the root problem or not.

That’s the problem you see now. What I see, is if you have that problem, I bet you have missed a lot of steps in the workflow well before getting to this point…

I have similar situation, but I think the issue is that completion is running hot, as I understand it cannot be more than 16,384 tokens, in this case question is why completion_tokens and prompt_tokens are added in error handling? Completion is output which should not be larger than 16,384, but context window is large.

Anyway, seems like model trying to output more than a max output allows

Could you show an example of what the stored object looks like and what fields are embedded and how?

Apologies, but I can’t share company’s data in such a detail in public channels.
I store two types of data for the same chunk: 1. structured Markdown; 2. plaintext. Then along with that record I store embeddings from plaintext which are used for distance calculation against user query. If/when particular chunk gets selected, I add MD version to the context.
The story is not about how my data is structured, it is about how specific chunks are selected.
Let’s just assume my data is in a good shape.

Also, what are the often queries sent to your rag engine?

Queries are natural human conversations as a part of customer support process through several support channels, e.g. Slack or Zendesk.

That’s the problem you see now. What I see, is if you have that problem, I bet you have missed a lot of steps in the workflow well before getting to this point…

I bet you don’t know what steps I have in my workflow :slight_smile:
Could you name a single one you believe I could be missing?

Also, what are the often queries sent to your rag engine? Then I’ll be able to tell if that’s the root problem or not.

Actually you can see what users are inquiring about in our public Slack support workspace: Slack

Hope this helps.

Sure, I understand, at least some details on the length, formatting and the embedding models would help.

Also, speaking of the missed steps, storing basically 2 variants of the same text in a chunk may be what I call “sub-optimal”. (are they longer than a couple of pararaphs? Do they have a title/description/summary or purpose attached to them? Do you have the parent tracking procedure to grab the content above in the structure etc?) But then, it depends on the app and the tools.

Here is an example of what I’m talking about when I say pre-processing:

knowledge base items build in 2 minutes of processing with about 12 operations from this pdf: https://betechnamibia.com/Point%20of%20Sale%20-%20Training%20Manual.pdf

The things get interesting when you dive in 03-outline.txt and the 05-knowledge-items.json files to see what the stored objects looks like.

I’ll get back to this a bit later, busy with kids tonight.

Sure, I don’t. But when I had such issues in the beginning (not knowing how to select related context after the vector search), I started digging deeper and found out that if I adress the stored data structures correctly these problems disappear without any additional efforts…

1 Like

I would definitely introduce a user intent parser before even running any queries to my DB, just to understand better what is going on (and also potentially capture noob terminology). Here is an example of “intent identification” model (4o-mini no context about the subject, simple task description in system message):

User:

  • Can I update UPC?

Assistant (internally):

{
  "unknown_terms":["UPC"],
  "primary_intent":"User is seeking to understand whether they can update or change a Universal Product Code (UPC) associated with a product.",
  "secondary_intents":
    [
      "User may want to know the process or requirements for updating a UPC.",
      "User might be concerned about the implications of changing a UPC on inventory or sales tracking."
    ],
  "clarification_needed":
    [
      "What specific context are you referring to? Is this for an inventory system, a retail platform, or something else?",
    "Are you looking to update the UPC for a specific product, and if so, which one?"
    ]
}

Having the intent as above

(assuming I have no context as new thread) :

  1. I would call KB for UPC definition and alternative names to help the clarification questions generator to ask the user questions I need answers before
  2. I start query building for rag.

And if I have already the context, either my model does not need clarification on the subject or I handle those as above. In this case I just get my unclear terms definitions and using the list of intents would build the queries to get knowledge items.

I’ll send some more later. Here is the preset in playground of the intent analysis example above (basic one, done in 5 min): https://platform.openai.com/playground/p/KHthRTkh1nCW08ZbRsGTR51v?mode=chat

3 Likes

Thank you! I will definitely give it a try to see how to integrate into existing workflow.

2 Likes