Partial pre filled pydantic schema for LLM

The goal is to fill some missing value of dataframe using the context of a long text. It’s similar to data extraction, it’s a partial data extraction as some field are already pre filled.
My question what are the current state of art for this use case ? The challenge here is data consistency (keep the pre filled value, which can be useful for the llm as example of right behaviour)

There is some obvious methods I have tried :

  • prompt the llm to fill the blank
  • create Pydantic class with pre filled value and freeze the value. Which did not fully work, as a matter of fact the llm have filled that NaN but dropped some rows. (The situation was tricky but still…)

Let’s now see an minimal example. Let’s say we perform a LLM data extraction.

from pydantic import BaseModel
from openai import OpenAI
from typing import List 

client = OpenAI()

class CalendarEvent(BaseModel):
    name: str
    date: str
    participants: list[str]

class Calendar(BaseModel):
    calendar: List[CalendarEvent]

completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", "content": "Extract the event information."},
        {"role": "user", "content": long_text},
    ],
    response_format=Calendar,
)

event = completion.choices[0].message.parsed

However, we have partial data, which can happen for many reasons (previous llm extraction (divide to conquer), ML technics, NLP, …).
So the goal of the extraction change, now it’s about filling the blank (NaN value).

import math

dummy_calendar = {
    "calendar": [
        # Fully populated event
        {
            "name": "Team Sync",
            "date": "2025-05-10",
            "participants": ["Alice", "Bob"],
        },
        # Missing date
        {
            "name": "Project Kickoff",
            "date": math.nan,
            "participants": ["Charlie", "Dana"],
        },
        # Missing name
        {
            "name": math.nan,
            "date": "2025-05-15",
            "participants": ["Eli", "Frank"],
        },
        # Missing participants
        {
            "name": "Quarterly Planning",
            "date": "2025-06-01",
            "participants": math.nan,
        },
    ]
}

Thank you for your help!

Have you taken a look at predicted outputs?

1 Like

You can simply specify the Pydantic class as your preferred OUTPUT. If you do Calendar.model_json_schema() you get a schema that you can clean up a little

def _clean_schema(schema: dict, parent_key: str = None) -> dict:
    """Recursively remove all instances of 'additionalProperties' from a JSON schema."""
    if isinstance(schema, dict):
        # schema.pop("additionalProperties", None)
        if not parent_key:
            schema.pop("title", None)
            schema.pop("default", None)
        for key, value in schema.items():
            if isinstance(value, dict):
                _clean_schema(value, key)
            elif isinstance(value, list):
                for item in value:
                    if isinstance(item, dict):
                        _clean_schema(item, parent_key)
    return schema

And then you pass that as a response_class to your request. You will now get a response that will fit your class. You might consider using the Field() class to add a description to the fields that the model will then also use to determine what to do. For participants you might consider doing str and describe you want them comma separated. Otherwise your class gets a bit more complicated then needed to first play with it.

Use this as the reference API to add the json output: https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format

The model should have no problem with some fields known others not already etc - as long as you provide it all.

2 Likes