Unstable performance with PDF files in Responses API / Docs are also unclear on capabilities

,

I have been testing PDF files in the Responses API, and got inconsistent results with gpt-4.1. Around 75% of the time, it would extract the PDF perfectly according to my StructuredOutput model. But 25% of the time, it would ignore the file and hallucinate completely fake information.

In order to better understand what’s driving this, I tried sending two files to a variety of models: (a) a one-page PDF containing parsable text and a simple drawing; (b) the first file rendered to a PNG image.

To my surprise, the results were inconsistent in a way that is not clear from OpenAI’s API docs:

  • gpt-4.1 - extracted text but not drawing from the PDF. Got both from the image.
  • gpt-4.1-mini - extracted text but not drawing from the PDF. Got both from the image.
  • gpt-4.1-nano - extracted text and drawing from the PDF. Ignored the second image file.
  • gpt-4o - extracted text from PDF, not the drawing. Got both from the image.
  • gpt-4o-mini - Ignored the PDF file. Got both from the image.
  • o4-mini - extracted text from the PDF, not the drawing. Got both from the image.
  • gemini-2.5-flash-preview - extracted both from both PDF and image

Can someone from OpenAI please explain what capabilities these models should have? And if there are temporary limitations, how can we as developers find this out?

Needless to say, this feature really needs to be documented and work predictable and reliably. Otherwise it’s more and more tempting to migrate our LLM stack over to Google!

Thanks

Have been having similar issues. Context window length seems to be involved but not determinate. If the pdf is large it will often hallucinate responses. Even with a small pdf but a long answer it will hallucinate. Very unreliable. Have you tested Google’s handling of larger documents?

This has continued to be an issue that is not resolved.

You can look at the input token counts on an API call.

Give yourself a “regenerate” button, and watch the token counts then.

The PDFs are not being loaded to context half the time.

If using file_id, the files endpoint is also messed up right now.