Structured output Precision / Accuracy: Pydantic vs a Schema

I’m working on a project that’s reading semi-structured data from emails and pdf’s and putting it into rigidly structured data. Ideally there are “accurate” and consistent answers (I’m aware that LLM’s are not deterministic). I’ve spent a lot of time with prompts trying to impove output; I pass a schema of the json I want back. That seems to work pretty well, but can sometimes be inconsistent.

Has anyone every moved from a schema over to pydantic definitions for structured output and seen it work better? This is a heavy lift for me at this point, but if I can squeeze some percentage points in consistency and accuracy, I’d give it a shot.

1 Like

Hi @john47 ,
I was actually about to post a similar question, meaning if there’s a fundamental difference between using the client.chat.completions.create and the client.beta.chat.completions.parse (using pydantic model).

Below is a very simple example:

  • using pydantic
class CalendarEvent(BaseModel):
    name: str = Field(description="the name")
    date: str = Field(description="the date")
    participants: list[str]

res = client.beta.chat.completions.parse(
    model=config["OPEN_AI_MODEL_4o_MINI"],
    messages=[
        {"role": "system", "content": "Extract the event information."},
        {"role": "user", "content": "Alice and Bob are going to a science fair on Friday."},
    ],
    response_format=CalendarEvent,
)
res.choices[0].message.parsed.model_dump()
# {'name': 'Science Fair', 'date': 'Friday', 'participants': ['Alice', 'Bob']}
  • using structured output
res2 = client.chat.completions.create(
    model=config["OPEN_AI_MODEL_4o_MINI"],
    messages=[
        {"role": "system", "content": """
         - Extract the event information.
         - The structure should be returned in JSON containing:
                - the name of the event, the key name should be `event`
                - the date of the event, the key name should be `date`
                - an array of all participants, the key name should be `participants`
         """},
        {"role": "user", "content": "Alice and Bob are going to a science fair on Friday."},
    ],
    response_format={ "type": "json_object" },
)
print(res2.choices[0].message.content)
#{
#    "event": "science fair",
#    "date": "Friday",
#    "participants": ["Alice", "Bob"]
#}

In this example both output are the same but I prefer pydantic because it seems to offer more control on the desired output, with less prompting.
It also provides the possibility to have default values, so it can be useful in case the data extraction fails.
Final point is that you can add all the pydantic validation tools directly.

I did not see on larger examples a difference in term on timing between both methods.

You are not exactly providing what is asked. You are using “JSON Mode”, with json_object as the response format, and just talking to the AI.

What is being inquired after is passing a schema as object (JSON), versus passing a pydantic BaseModel into the OpenAI Python SDK as the response_format parameter input, and having the openai library convert the Pydantic hierarchy into a schema JSON to be sent.

Pointers:

  • If you observe what is being sent as schema and it is the same as you wrote manually, then results will be the same;

  • the Python parse() method only creates the additional response object “parse” internally and validates the AI return if using Pydantic.

  • Your own, you can use strict:false, and just use the response_format as a way of sending a schema without the enforcement in the structure of what the AI can write, allowing optional keys, and errors in AI generation.

1 Like

@jonathan.bouchet isn’t too far off here. These are all strong arguments for using the pydantic methods. The extraction we’re looking for has to be repeatable, so things like forcing a structure, and forcing a default value are helpful.