I am currently trying to create an app that have to send a PDF file (the file is sometimes scanned or not) to OpenAI API, then ask something like:
“Analyze this PDF file and return a JSON with the following structure:
{
‘user’: string,
‘issue’: string,
‘date’: string // format YYYY-MM-DD
}”
My current questions are:
What’s the best approach to implement this with the Assistants API?
Should I use file_search or any other tool?
Any specific setup for PDF handling?
Technical implementation:
Using TypeScript with OpenAI SDK v4.80.0
Need to handle the file upload correctly
Want to ensure proper error handling
Has anyone successfully implemented something similar? Any best practices or pitfalls to avoid?
Code snippet of my current attempt:
const assistant = await openai.beta.assistants.create({
name: "PDF Analyzer",
model: "gpt-4o",
instructions: "You are an expert at analyzing PDF files and extracting specific information in JSON format.",
tools: [{ type: "file_search" }]
});
// Create thread and process file
const thread = await openai.beta.threads.create();
const file = await openai.files.create({
file: pdfContent,
purpose: 'assistants'
});
// Add message with JSON requirements
await openai.beta.threads.messages.create(thread.id, {
role: "user",
content: `Analyze the PDF and return a JSON with this structure:
{
"user": string,
"issue": string,
"date": string // YYYY-MM-DD
}
Ensure all fields are present and properly formatted.`,
file_ids: [file.id]
});
Doing this is certainly possible and is definitely done, but it depends on what you want to do, and more specifically, how you want to analyze the document.
Couple things to note: scanned PDFs and typed PDFs can be two fundamentally different kinds of documents. Typed PDFs, typically ones that were made directly from stuff like Microsoft word, are easy to analyze and extract information from. Scanned PDFs iirc would likely need a visual analysis with the omni models, no matter the clarity of the scan. I would be prepared for handling them separately.
Secondly, the kind of analysis you’d like to do would also affect things. Are you trying to get summarized information? Are you trying to spit out verbatim passages?
Finally, I would look into RAG storage, as vector stores are how a majority of folks here store PDF data.
I am trying to read scanned and/or typed pdfs, and extract some information (like a date, a document number etc…) one by one because each document is independant. I don’t need to store these cause none of them are linked, but they all have the same purpose (ordering something).
I guess converting them to image, then using gpt 4o mini to read them is the best option ? Or do you have something better ?
You can use PDF tools with built-in OCR to convert the file to searchable text.
Many scanners have software packages that do this also.
If you can select and copy text out of the PDF file currently, and it is not password-locked, it should be suitable for the programmatic text extraction done in preparing documents for vector stores.
I tried a new way to make it… Using Assistants API, we can ask the assistant to output a JSON.
But when I try it with the playground, i does not work… It creates a correct JSON but with mock values… What am I doing wrong ?
In Assistants you have a choice to make - only one of:
structured outputs as an API parameter with a schema, OR
file search using a vector store.
You have no file search enabled in the screenshot. You asked the AI about some PDF file that you didn’t supply. You gave it a mandatory JSON that it must fill out, with no option such as an alternate anyOf subschema for “no information present from PDF”.
Therefore, the AI must fill out the specified JSON format, and cannot deviate. You are the one forcing it to fabricate.
You’ll need to supply your PDF by a vector store for file search. This means that you can only talk to the AI about what kind of automated JSON it must produce for you in instructions. Then figure out a different kind of response so the AI can report an error condition that it did not receive adequate direct information.
Also, you cannot refer to a specific file name like you show. Search will search across all documents provided, and provide chunks from documents of highest relevance to the search terms the AI model writes. This means you can use it for additional knowledge, but not to answer about an entire PDF as one unit.
Hopefully that knowledge will let you adapt both the application of files and the generation of responses to be truthful and satisfying.
Also, in this case, would it be a good approach to add a file to the vector store, process it, and then delete it? This way, there would always be only one PDF at a time.
If none of this works properly, would the best approach be to go back and use the chat, with the PDF converted to an image + the extracted text included directly in the prompt?
Edit: I tried to add the pdf in vector store from the file search button and asked to create the JSON but it did not worked unfortunately…