The goal is to fill some missing value of dataframe using the context of a long text. It’s similar to data extraction, it’s a partial data extraction as some field are already pre filled.
My question what are the current state of art for this use case ? The challenge here is data consistency (keep the pre filled value, which can be useful for the llm as example of right behaviour)
There is some obvious methods I have tried :
- prompt the llm to fill the blank
- create Pydantic class with pre filled value and freeze the value. Which did not fully work, as a matter of fact the llm have filled that NaN but dropped some rows. (The situation was tricky but still…)
Let’s now see an minimal example. Let’s say we perform a LLM data extraction.
from pydantic import BaseModel
from openai import OpenAI
from typing import List
client = OpenAI()
class CalendarEvent(BaseModel):
name: str
date: str
participants: list[str]
class Calendar(BaseModel):
calendar: List[CalendarEvent]
completion = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": "Extract the event information."},
{"role": "user", "content": long_text},
],
response_format=Calendar,
)
event = completion.choices[0].message.parsed
However, we have partial data, which can happen for many reasons (previous llm extraction (divide to conquer), ML technics, NLP, …).
So the goal of the extraction change, now it’s about filling the blank (NaN value).
import math
dummy_calendar = {
"calendar": [
# Fully populated event
{
"name": "Team Sync",
"date": "2025-05-10",
"participants": ["Alice", "Bob"],
},
# Missing date
{
"name": "Project Kickoff",
"date": math.nan,
"participants": ["Charlie", "Dana"],
},
# Missing name
{
"name": math.nan,
"date": "2025-05-15",
"participants": ["Eli", "Frank"],
},
# Missing participants
{
"name": "Quarterly Planning",
"date": "2025-06-01",
"participants": math.nan,
},
]
}
Thank you for your help!