PDF files being cut off in Responses API

freshquinoa · June 26, 2025, 7:48pm

I am working on a data extraction task with the responses API and o3. The task is working fine with a subset of PDFs for testing, but when I upload a certain PDF it seems like the model is only seeing half of the document (tested by uploading my prompt into playground and asking it for specific information in the first half of the document). The file is only 2MB, which is larger than my other files, but seems like it should still be within my context length. I am also using the structured output feature. The weirdest part is that I made this API call 3 days ago with the same document and the extraction worked perfectly. Though I made some slight modifications to the structured output schema.

Does anyone have advice on how to debug this? I find file upload and structured output to be pretty opaque in the documentation.

Here is the API call for reference (python sdk).

response = client.responses.parse(
                model="o3",
                instructions=instructions,
                input=[
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "input_file",
                                "filename": os.path.basename(file_path),
                                "file_data": f"data:application/pdf;base64,{base64_string}",
                            }
                        ],
                    },
                ],
                reasoning={
                    "summary": "detailed"
                },
                text_format=Output
            )

freshquinoa · June 26, 2025, 7:44pm

I am working on a data extraction task with the responses API and o3. The task is working fine with a subset of PDFs for testing, but when I upload a certain PDF it seems like the model is only seeing half of the document (tested by uploading my prompt into playground and asking it for specific information in the first half of the document). The file is only 2MB, which is larger than my other files, but seems like it should still be within my context length. I am also using the structured output feature. The weirdest part is that I made this API call 3 days ago with the same document and the extraction worked perfectly. Though I made some slight modifications to the structured output schema.

Does anyone have advice on how to debug this? I find file upload and structured output to be pretty opaque in the documentation.

Here is the API call for reference (python sdk).

response = client.responses.parse(
                model="o3",
                instructions=instructions,
                input=[
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "input_file",
                                "filename": os.path.basename(file_path),
                                "file_data": f"data:application/pdf;base64,{base64_string}",
                            }
                        ],
                    },
                ],
                reasoning={
                    "summary": "detailed"
                },
                text_format=Output
            )

lucmachine · June 27, 2025, 4:11am

Hey @freshquinoa

I can really empathize with your situation. I’ve spent a good deal of time on PDF extraction pipelines, and I wanted to share a deeper perspective that might clarify what’s happening, and offer potential solutions.

PDFs Are Not Uniform

What you’re running into is a known but rarely addressed issue: PDF files are not structured documents in the way that models like GPT expect. They are visual layout files, built more like graphic design blueprints than semantic text containers.

This means:

Text flow may not follow a logical reading order.
Tables can appear visually neat but contain no actual tabular structure under the hood.
The same number of pages or file size can result in wildly different token counts, especially if the text is embedded in strange ways or the PDF includes layers, footers, metadata, or OCR artifacts.

So, two PDFs of similar size and length can behave very differently when passed to the model, especially through the responses.parse() API which, under the hood, has to tokenize the whole thing.

Structured Output + File Upload = Hidden Constraints

You’re using:

responses.parse() with a structured output schema, and
an input_file (base64-encoded PDF).

What likely happened:

When you modified your structured schema, the model reprioritized what to extract.
At the same time, the token budget (even with GPT-4o’s 128k limit) may have been silently exceeded, especially if the PDF had lots of embedded complexity.
As a result, the model likely truncated the document internally and focused only on what it could safely reason over, typically the beginning of the file.

Structured output imposes stricter parsing and inference logic. If the schema matches less cleanly, or the model feels it can’t confidently meet the expected format, it may drop or skip later sections without warning.

Potential Solutions

Here’s a more resilient approach:

Pre-process the PDF before sending it to the API.

If needed process manually - this is a mindfull way to fix many issues.
Use tools like pdfplumber, PyMuPDF, or pdfminer.six to extract clean text from the document.
If the document includes tables, consider extracting them separately using camelot or tabula, depending on how they’re rendered.

Split long documents into sections (by page range, heading, or paragraph count), and feed them sequentially if needed.
Inspect token count after extraction to ensure you’re not nearing 100k+ tokens. Use OpenAI’s tiktoken library for this.
Test your structured schema against smaller, clean documents first, then scale up once you confirm reliability.
If using images or scanned text, OCR first, then follow the same text-cleaning pipeline before model input.

This isn’t just about fixing a bug. It’s about aligning with the underlying principle: structure must be honored before it can be interpreted. We often expect LLMs to “just figure it out,” but when it comes to semi-structured documents like PDFs, it’s far more stable to serve them clean, linear text with intentional structure.

By handling layout fragmentation, cleaning malformed content, and controlling token load before it reaches the model, you preserve integrity and ensure the model works with clear signals , not garbled noise.

Happy to help further if you want guidance setting up a lightweight preprocessing pipeline, it’s made all the difference in my own projects.

Warmly,
Luc

Topic		Replies	Views
Unstable performance with PDF files in Responses API / Docs are also unclear on capabilities Bugs api , pdf	2	284	June 5, 2025
File_search assistants api - not returning full output, but just a preview of the output API lost-user , assistants-api , gpt-4o , file-search	4	300	February 4, 2025
PDF page identification errors with file search on assistants v2 api. Paging problem. Pages not in chunk metadata? API assistants-api , file-uploads , gpt-4o	5	975	June 4, 2024
Structured Output Issue in GPT-4o API – Response Truncation at Specific Index API api , structured-output	1	272	March 19, 2025
Inconsistent Responses with PDF File Upload in OpenAI Chat Completion API Bugs	3	344	June 3, 2025

PDF files being cut off in Responses API

PDFs Are Not Uniform

Structured Output + File Upload = Hidden Constraints

Potential Solutions

Related topics