Best practices for PDF parsing with Assistants API and file_search tool

Hi everyone,

I am currently trying to create an app that have to send a PDF file (the file is sometimes scanned or not) to OpenAI API, then ask something like:
“Analyze this PDF file and return a JSON with the following structure:
{
‘user’: string,
‘issue’: string,
‘date’: string // format YYYY-MM-DD
}”

My current questions are:

  1. What’s the best approach to implement this with the Assistants API?

    • Should I use file_search or any other tool?
    • Any specific setup for PDF handling?
  2. Technical implementation:

    • Using TypeScript with OpenAI SDK v4.80.0
    • Need to handle the file upload correctly
    • Want to ensure proper error handling

Has anyone successfully implemented something similar? Any best practices or pitfalls to avoid?

Code snippet of my current attempt:

const assistant = await openai.beta.assistants.create({
  name: "PDF Analyzer",
  model: "gpt-4o",
  instructions: "You are an expert at analyzing PDF files and extracting specific information in JSON format.",
  tools: [{ type: "file_search" }]
});

// Create thread and process file
const thread = await openai.beta.threads.create();
const file = await openai.files.create({
  file: pdfContent,
  purpose: 'assistants'
});

// Add message with JSON requirements
await openai.beta.threads.messages.create(thread.id, {
  role: "user",
  content: `Analyze the PDF and return a JSON with this structure:
  {
    "user": string,
    "issue": string,
    "date": string  // YYYY-MM-DD
  }
  Ensure all fields are present and properly formatted.`,
  file_ids: [file.id]
});

Hey there and welcome to the community!

Doing this is certainly possible and is definitely done, but it depends on what you want to do, and more specifically, how you want to analyze the document.

Couple things to note:
scanned PDFs and typed PDFs can be two fundamentally different kinds of documents. Typed PDFs, typically ones that were made directly from stuff like Microsoft word, are easy to analyze and extract information from. Scanned PDFs iirc would likely need a visual analysis with the omni models, no matter the clarity of the scan. I would be prepared for handling them separately.

Secondly, the kind of analysis you’d like to do would also affect things. Are you trying to get summarized information? Are you trying to spit out verbatim passages?

Finally, I would look into RAG storage, as vector stores are how a majority of folks here store PDF data.