Increased False Negatives in Attribute Extraction Using Structured Outputs

After implementing structured outputs using a Pydantic model, I’ve observed a decline in data extraction accuracy. Previously, my prompt effectively extracted attributes from documents without structured outputs. However, with structured outputs, few attribute extractions are failing, resulting in increased false negatives during evaluation. Notably, after reducing the schema hierarchy, the number of false positives decreased. I seek to understand how structured outputs influence extraction, given that the prompt remains unchanged, and how the prompt and structured output schema (Pydantic model) interact. Insights into this interaction would help me adjust the prompt accordingly.
Model - gpt-4o
version- 2024-08-06
api version - 2024-10-01-preview

Any assistance is greatly appreciated.

I would consider that a sign of my schema be a bit too complex for the task… Providing the prompts and schema would drastically help to inspect the issue.

2 Likes

The Pydantic model should be used simply to provide the response schema/types - shouldn’t put descriptions into Pydantic model itself. It’s mostly to define a harness for what the response types should be, so that you can also validate those properly.

The descriptions, rules, explanations, etc should go into the system prompt.

As Serge mentioned, try to keep your Pydantic model as flat as possible, without complex hierarchies or nesting.

Have you tried the latest November models to see if they behave the same?

1 Like