How to Use a Single Pydantic Model for Structured Output with Long Documents in a Chunked RAG Pipeline?

Here’s my current workflow for using Pydantic to structure the output of an LLM in a RAG pipeline:

  1. I upload a document.
  2. The document is chunked, embedded, and stored in a vector database.
  3. I define a Pydantic model with multiple fields (e.g., title, author, signature_name, etc.).
  4. I query the LLM to extract structured information by providing the context retrieved from the vector store.

This workflow works well for smaller documents that fit entirely within the LLM’s context window. However, I encounter challenges with long documents that require chunking.

For example, let’s say my Pydantic model contains these fields:

  • title: Found on the first page of the document.
  • signature_name: Found on the last page of the document.

Since these pieces of information are located in different chunks, the LLM cannot extract both fields simultaneously because the required context doesn’t fit into a single query. As a result, some fields are missed or misinterpreted.

Current Workaround

For longer documents, I currently handle this by:

  • Performing separate queries for each field.
  • Parsing and post-processing the outputs manually.

However, this approach is cumbersome and doesn’t leverage Pydantic’s validation capabilities effectively.

My Goal

I’d like to find a way to use a single Pydantic model to extract structured information from long documents, even when the required data spans multiple chunks. One potential workaround I’ve considered is creating multiple Pydantic BaseModels, but that feels overly complicated and not ideal.

My Question

How can I adapt my workflow to handle long documents where the required fields in a Pydantic model are spread across multiple chunks? Is there a method or strategy to:

  1. Combine the outputs from multiple queries/chunks in a way that aligns with a single Pydantic model?
  2. Leverage Pydantic’s validation to streamline this process without splitting the model into multiple BaseModels?

I’d appreciate any insights, tools, or approaches to solve this problem efficiently.

1 Like

Hitting this exact problem here too. I’m thinking of having a Pydantic model with all optional/None-able fields and then doing an equality comparison, but that breaks if the string or numerical field values are inconsistent (probably more so string)

The problem here is not in regards to structured outputs themselves, but that you want the AI model to respond with information that your RAG has not given to the AI.

What I would do is consider all those additional fields as document metadata.

Then, any chunk of a large document returned from semantic search shall also include its metadata, the same metadata for all of the document parts (and maybe with page number or chunk number enhancing the return to the AI).

Thus, your search might have returned chunk #3 and #6 corresponding to those page numbers, but each of them includes the metadata you need as output.

You can also do elided document reconstruction, where you unify the metadata for one document, and then have the ordered chunks that all came from one source stitched back together.

You can also run embeddings to include metadata such as the title or subject or authors, which may improve the search quality itself.

Most of all, structured outputs gives the AI only one way to respond, making fabrication of data very possible, so it is a good idea to have a “fail” anyOf subschema the AI can employ if it doesn’t have all the fields of its normal structured output from a search - or the search gives nothing useful at all.

Hey Owen,

Great question! This is a challenge many encounter when extracting structured information across multiple chunks, and you’re absolutely right that manually merging separate queries isn’t ideal.

Here are a few potential strategies to streamline this while keeping everything aligned within a single Pydantic model:

:one: Iterative Retrieval & Merging with Progressive Updates

• Instead of making separate queries per field, consider an iterative approach where the LLM progressively updates the Pydantic model as more relevant chunks are retrieved.

• You can start with the first query capturing initial fields (e.g., title from the first chunk).

• Then, subsequent queries reference the partially filled Pydantic model and request missing fields (signature_name from the last chunk).

• This way, each LLM call builds upon previous outputs, reducing redundant parsing.

:two: Chunk Aggregation Before Extraction

• Instead of querying per field, aggregate all relevant chunks first, then pass them as a single input to the LLM.

• Example: Retrieve title (first page) + signature_name (last page) at the same time before extraction.

• This method avoids separate queries per field and gives the LLM a broader contextual view.

:three: Custom Function to Merge Multi-Query Outputs into a Unified Model

• Store individual query results into a temporary dict.

• Use Pydantic’s update_forward_refs() to progressively update fields as more data arrives.

• Validate only after all required fields are populated.

:four: Chunk-Aware Querying

• Instead of blind chunking, tag your embeddings with metadata (e.g., page_number, section, importance score).

• Then use a context-aware retriever that prioritizes pulling chunks most likely to contain high-value fields before making LLM calls.

:sparkles: Final Thought

Your instinct is right—splitting into multiple BaseModels feels unnecessary. The key is designing a retrieval & update strategy that allows the Pydantic model to progressively fill itself in a structured way.

Let me know if you want an example implementation of one of these methods—I’d be happy to dig deeper! :rocket: