I am working on a data extraction task with the responses API and o3. The task is working fine with a subset of PDFs for testing, but when I upload a certain PDF it seems like the model is only seeing half of the document (tested by uploading my prompt into playground and asking it for specific information in the first half of the document). The file is only 2MB, which is larger than my other files, but seems like it should still be within my context length. I am also using the structured output feature. The weirdest part is that I made this API call 3 days ago with the same document and the extraction worked perfectly. Though I made some slight modifications to the structured output schema.
Does anyone have advice on how to debug this? I find file upload and structured output to be pretty opaque in the documentation.
Here is the API call for reference (python sdk).
response = client.responses.parse(
model="o3",
instructions=instructions,
input=[
{
"role": "user",
"content": [
{
"type": "input_file",
"filename": os.path.basename(file_path),
"file_data": f"data:application/pdf;base64,{base64_string}",
}
],
},
],
reasoning={
"summary": "detailed"
},
text_format=Output
)
I am working on a data extraction task with the responses API and o3. The task is working fine with a subset of PDFs for testing, but when I upload a certain PDF it seems like the model is only seeing half of the document (tested by uploading my prompt into playground and asking it for specific information in the first half of the document). The file is only 2MB, which is larger than my other files, but seems like it should still be within my context length. I am also using the structured output feature. The weirdest part is that I made this API call 3 days ago with the same document and the extraction worked perfectly. Though I made some slight modifications to the structured output schema.
Does anyone have advice on how to debug this? I find file upload and structured output to be pretty opaque in the documentation.
Here is the API call for reference (python sdk).
response = client.responses.parse(
model="o3",
instructions=instructions,
input=[
{
"role": "user",
"content": [
{
"type": "input_file",
"filename": os.path.basename(file_path),
"file_data": f"data:application/pdf;base64,{base64_string}",
}
],
},
],
reasoning={
"summary": "detailed"
},
text_format=Output
)
Hey @freshquinoa
I can really empathize with your situation. I’ve spent a good deal of time on PDF extraction pipelines, and I wanted to share a deeper perspective that might clarify what’s happening, and offer potential solutions.
PDFs Are Not Uniform
What you’re running into is a known but rarely addressed issue: PDF files are not structured documents in the way that models like GPT expect. They are visual layout files, built more like graphic design blueprints than semantic text containers.
This means:
- Text flow may not follow a logical reading order.
- Tables can appear visually neat but contain no actual tabular structure under the hood.
- The same number of pages or file size can result in wildly different token counts, especially if the text is embedded in strange ways or the PDF includes layers, footers, metadata, or OCR artifacts.
So, two PDFs of similar size and length can behave very differently when passed to the model, especially through the responses.parse()
API which, under the hood, has to tokenize the whole thing.
Structured Output + File Upload = Hidden Constraints
You’re using:
responses.parse()
with a structured output schema, and
- an input_file (base64-encoded PDF).
What likely happened:
- When you modified your structured schema, the model reprioritized what to extract.
- At the same time, the token budget (even with GPT-4o’s 128k limit) may have been silently exceeded, especially if the PDF had lots of embedded complexity.
- As a result, the model likely truncated the document internally and focused only on what it could safely reason over, typically the beginning of the file.
Structured output imposes stricter parsing and inference logic. If the schema matches less cleanly, or the model feels it can’t confidently meet the expected format, it may drop or skip later sections without warning.
Potential Solutions
Here’s a more resilient approach:
- Pre-process the PDF before sending it to the API.
- If needed process manually - this is a mindfull way to fix many issues.
- Use tools like
pdfplumber
, PyMuPDF
, or pdfminer.six
to extract clean text from the document.
- If the document includes tables, consider extracting them separately using
camelot
or tabula
, depending on how they’re rendered.
- Split long documents into sections (by page range, heading, or paragraph count), and feed them sequentially if needed.
- Inspect token count after extraction to ensure you’re not nearing 100k+ tokens. Use OpenAI’s
tiktoken
library for this.
- Test your structured schema against smaller, clean documents first, then scale up once you confirm reliability.
- If using images or scanned text, OCR first, then follow the same text-cleaning pipeline before model input.
This isn’t just about fixing a bug. It’s about aligning with the underlying principle: structure must be honored before it can be interpreted. We often expect LLMs to “just figure it out,” but when it comes to semi-structured documents like PDFs, it’s far more stable to serve them clean, linear text with intentional structure.
By handling layout fragmentation, cleaning malformed content, and controlling token load before it reaches the model, you preserve integrity and ensure the model works with clear signals , not garbled noise.
Happy to help further if you want guidance setting up a lightweight preprocessing pipeline, it’s made all the difference in my own projects.
Warmly,
Luc
1 Like