I’m trying to parse a PDF travel ticket of 5 to 6 pages using the OpenAI GPT-4o-mini model. My approach involves converting each page of the PDF into images, then combining these page images into a single image. I send this combined image along with a prompt to the model for parsing.
However, when the PDF has many pages, the combined image height becomes too large and exceeds the maximum size supported by the API, resulting in an “invalid image” error.
Has anyone faced a similar issue or have suggestions on how to
handle multi-page PDFs without exceeding the image size limit?
Parse data accurately from multi-page pdf maintaining the context?
It’s tough. It’s a massive shame that OpenAI doesn’t allow PDFs (competitors do). So the first step is to try and first extract the text content from the PDF and then pass it to the model along with the now semantic-destroyed image file.
Then, you need to determine if it’s possible for the model to work in parallel for each page. Or if the model is even needed (an OCR may work here, I’d recommend checking out OCR2.0)
Since each model works on each page in parallel you can then perform a final synthesis stage where the model combines all the gathered information together to form whatever expected output (I imagine you’re expecting something structured)
With an appropriate “approval” system that categorizes and sorts these processed documents you can then go deeper and start to perform classification steps on the pages to determine if they are even useful. In my case a lot of documents come with terms & conditions and stuff and it’s nasty and noisy so I just eliminate those pages, leading to cheaper costs and more accurate results.
I was looking for something similar, but following your suggestions and setting it up myself was way too much work. I just found the tool Parble and it is exactly what I needed. Might help you too!