Efficient Retrieval Methods of Relevant Chunks for Pydantic BaseModel for RAG Structured Output

I am currently working on a project where I need to generate structured outputs using Pydantic’s BaseModel. Specifically, I need to retrieve relevant text chunks for each field in my model to minimize errors and ensure accurate data representation.

Context:

I have defined a Pydantic model with several fields, and I want to ensure that each field is populated with contextually relevant data extracted from a larger text corpus.

Current Approach:

Currently, I am using a method that retrieves relevant chunks based on a broad query. However, I find it challenging to associate specific chunks with individual model fields effectively.

Question:

Is there an efficient way to retrieve and associate relevant chunks for each item in a Pydantic BaseModel? Any guidance or best practices on structuring this retrieval process would be greatly appreciated.

Example:

Here’s a simplified version of my Pydantic model:

from pydantic import BaseModel, Field
from typing import List

class Report(BaseModel):
    title: str = Field(description="Title of the report")
    author: str = Field(description="Author of the report")
    introduction: str = Field(description="Introduction of the report")
    findings: str = Field(description="Findings of the report")
    conclusion: str = Field(description="Conclusion of the report")

I currently retrieve relevant chunks using a single query, but I want each field (e.g., introduction, findings, conclusion) to be populated with its corresponding relevant chunks.

Beyond any Pydantic formalism, you would just embed each field, then correlate on all fields. Return back the highest, and maybe expand back into the full record(s) for the highest correlations.

1 Like

What have you tried? How large is the corpus? Is one Pydantic record one corpus? Most models should be able to just to this based on the a good prompt? Possibly in two steps, one that would start with creating specific summaries first and then possibly using the summaries to create even shorter versions of those for your object?

1 Like