After implementing structured outputs using a Pydantic model, I’ve observed a decline in data extraction accuracy. Previously, my prompt effectively extracted attributes from documents without structured outputs. However, with structured outputs, few attribute extractions are failing, resulting in increased false negatives during evaluation. Notably, after reducing the schema hierarchy, the number of false positives decreased. I seek to understand how structured outputs influence extraction, given that the prompt remains unchanged, and how the prompt and structured output schema (Pydantic model) interact. Insights into this interaction would help me adjust the prompt accordingly.
Model - gpt-4o
version- 2024-08-06
api version - 2024-10-01-preview
Any assistance is greatly appreciated.