Best practices for PDF parsing with Assistants API and file_search tool

blovario · January 27, 2025, 9:50pm

Hi everyone,

I am currently trying to create an app that have to send a PDF file (the file is sometimes scanned or not) to OpenAI API, then ask something like:
“Analyze this PDF file and return a JSON with the following structure:
{
‘user’: string,
‘issue’: string,
‘date’: string // format YYYY-MM-DD
}”

My current questions are:

What’s the best approach to implement this with the Assistants API?
- Should I use file_search or any other tool?
- Any specific setup for PDF handling?
Technical implementation:
- Using TypeScript with OpenAI SDK v4.80.0
- Need to handle the file upload correctly
- Want to ensure proper error handling

Has anyone successfully implemented something similar? Any best practices or pitfalls to avoid?

Code snippet of my current attempt:

const assistant = await openai.beta.assistants.create({
  name: "PDF Analyzer",
  model: "gpt-4o",
  instructions: "You are an expert at analyzing PDF files and extracting specific information in JSON format.",
  tools: [{ type: "file_search" }]
});

// Create thread and process file
const thread = await openai.beta.threads.create();
const file = await openai.files.create({
  file: pdfContent,
  purpose: 'assistants'
});

// Add message with JSON requirements
await openai.beta.threads.messages.create(thread.id, {
  role: "user",
  content: `Analyze the PDF and return a JSON with this structure:
  {
    "user": string,
    "issue": string,
    "date": string  // YYYY-MM-DD
  }
  Ensure all fields are present and properly formatted.`,
  file_ids: [file.id]
});

Macha · January 28, 2025, 4:50am

Hey there and welcome to the community!

Doing this is certainly possible and is definitely done, but it depends on what you want to do, and more specifically, how you want to analyze the document.

Couple things to note:
scanned PDFs and typed PDFs can be two fundamentally different kinds of documents. Typed PDFs, typically ones that were made directly from stuff like Microsoft word, are easy to analyze and extract information from. Scanned PDFs iirc would likely need a visual analysis with the omni models, no matter the clarity of the scan. I would be prepared for handling them separately.

Secondly, the kind of analysis you’d like to do would also affect things. Are you trying to get summarized information? Are you trying to spit out verbatim passages?

Finally, I would look into RAG storage, as vector stores are how a majority of folks here store PDF data.

Topic		Replies	Views
What is the best way to parse a PDF file with ChatGPT? API	9	42908	November 16, 2024
Efficiently Interacting with super super Long PDFs/documents API gpt-4	2	1304	June 25, 2024
Design approach to using assistants, rag, and files Community assistants-api	2	200	October 26, 2024
Extracting specific data from pdf - fine tuning API gpt-4 , assistants-api	0	45	January 23, 2025
Programatically reproduce gpt-4o file upload API gpt-4o	5	223	December 19, 2024

Best practices for PDF parsing with Assistants API and file_search tool

Related topics