I have been testing PDF files in the Responses API, and got inconsistent results with gpt-4.1. Around 75% of the time, it would extract the PDF perfectly according to my StructuredOutput model. But 25% of the time, it would ignore the file and hallucinate completely fake information.
In order to better understand what’s driving this, I tried sending two files to a variety of models: (a) a one-page PDF containing parsable text and a simple drawing; (b) the first file rendered to a PNG image.
To my surprise, the results were inconsistent in a way that is not clear from OpenAI’s API docs:
- gpt-4.1 - extracted text but not drawing from the PDF. Got both from the image.
- gpt-4.1-mini - extracted text but not drawing from the PDF. Got both from the image.
- gpt-4.1-nano - extracted text and drawing from the PDF. Ignored the second image file.
- gpt-4o - extracted text from PDF, not the drawing. Got both from the image.
- gpt-4o-mini - Ignored the PDF file. Got both from the image.
- o4-mini - extracted text from the PDF, not the drawing. Got both from the image.
- gemini-2.5-flash-preview - extracted both from both PDF and image
Can someone from OpenAI please explain what capabilities these models should have? And if there are temporary limitations, how can we as developers find this out?
Needless to say, this feature really needs to be documented and work predictable and reliably. Otherwise it’s more and more tempting to migrate our LLM stack over to Google!
Thanks