How to Use a Single Pydantic Model for Structured Output with Long Documents in a Chunked RAG Pipeline?

Here’s my current workflow for using Pydantic to structure the output of an LLM in a RAG pipeline:

  1. I upload a document.
  2. The document is chunked, embedded, and stored in a vector database.
  3. I define a Pydantic model with multiple fields (e.g., title, author, signature_name, etc.).
  4. I query the LLM to extract structured information by providing the context retrieved from the vector store.

This workflow works well for smaller documents that fit entirely within the LLM’s context window. However, I encounter challenges with long documents that require chunking.

For example, let’s say my Pydantic model contains these fields:

  • title: Found on the first page of the document.
  • signature_name: Found on the last page of the document.

Since these pieces of information are located in different chunks, the LLM cannot extract both fields simultaneously because the required context doesn’t fit into a single query. As a result, some fields are missed or misinterpreted.

Current Workaround

For longer documents, I currently handle this by:

  • Performing separate queries for each field.
  • Parsing and post-processing the outputs manually.

However, this approach is cumbersome and doesn’t leverage Pydantic’s validation capabilities effectively.

My Goal

I’d like to find a way to use a single Pydantic model to extract structured information from long documents, even when the required data spans multiple chunks. One potential workaround I’ve considered is creating multiple Pydantic BaseModels, but that feels overly complicated and not ideal.

My Question

How can I adapt my workflow to handle long documents where the required fields in a Pydantic model are spread across multiple chunks? Is there a method or strategy to:

  1. Combine the outputs from multiple queries/chunks in a way that aligns with a single Pydantic model?
  2. Leverage Pydantic’s validation to streamline this process without splitting the model into multiple BaseModels?

I’d appreciate any insights, tools, or approaches to solve this problem efficiently.