I am working on a Python project using OpenIa API that processes emails daily and interacts with them. Currently, I download emails as PDFs and interact with these PDFs (e.g., extracting text, creating a vector store, etc.). It works, but I am not satisfied with the answers, especially compared to uploading the same PDFs to the ChatGPT website, where the responses are much better.
Am I approaching this the right way? What’s the best method to read PDFs? I’ve noticed some approaches convert PDFs to images, while others use an assistant directly. In my case, the PDFs can include graphs, and I feel that assistants don’t fully understand the graphs.
Step 1: Extract text from PDF
def extract_text_from_pdf(pdf_path):
text = “”
with open(pdf_path, ‘rb’) as file:
reader = PyPDF2.PdfReader(file)
for page in reader.pages:
text += page.extract_text()
return text
If you know that your documents have substantial graphical content (charts and tables), which is not expressed in plain text, then I would suggest treating each page as an image. How I’ve done this in the past, is convert each page to an image, and then send with a system prompt that describes how to output in Markdown format. You can also describe how to treat tables and charts - either converting them also to native markdown, or converting to another format (YAML works very well).
The rest of the process looks good to me (chunking and vectorizing text and creating conversational chains). For graphs and tables, I would try to keep them “together”, i.e. so you don’t split a single graph into separate chunks.
You might as well combine different strategies… like contextual analysis on each word where you put that into subgraphs of an information graph and find relations.
or spatial grouping (e.g. find a recipient address on an invoice - add a bounding box around it and when another invoice in same format comes you check if all bounding boxes in that rectangle have been added to the structured output format)…
or use a CNN to analyse floor plan drawings or statistics inside PDF to get relations…
There is a lot more than just NLP in semantical extraction…
There are guys who put stuff like this in their CV to show their skill level:
So you have to use an “ant algorithm” to find it haha…
Ok if it’s managed solutions then I would also recommend Unstructured. But honestly I’ve done multi-modal PDF parsing “manually” (e.g. using Pymupdf and GPT-4o) and it’s really not that hard.
Just registered at Llamaparse… uploaded a document and it takes more than 3 minutes already on a single pdf with 117 pages. Still “Loading”…
Is that a joke?
I’m just saying - my own OCR processor will find 97 out of 100 cases that I am searching for in a pdf - where assistant only find 51 at most and everytime assistant finds a different amount sometimes only 10 sometimes 25.
And damn I fought so hard to get it down from 12 seconds to less then 6… And I am still unhappy with that speed.
Wow - was writing all this and it still says “Loading” Is that some kind of a split pages and loop over them in one thread thing?
It really depends on the data doesn’t it?
A lot of PDF’s are easy. But then a lot of them really are not. Pitch decks for example - and platforms like Llamaparse and Structured make it ‘irrelvant’ what the input is - you get markdown back, with tables for charts etc.
So yeah … right tool for the right job.
Just an outside of the box question but do you have to download the emails in pdf or can you just download the source code, html, images etc?
Or build code to convert the pdf back to the source code and then have ChatGPT process that.
Processing PDFs can always sometimes be a bit of a pita with all the converting. I’m not surprised ChatGPT is missing things. 4o would miss things,1o and 1o mini might not.
If you can get your hands on the original code, it could make things easier for you though
Just a thought.
I wanted to convert the document to PDF because I was considering the possibility of other teams using the same tool by placing the file in a shared folder, making it accessible for interaction.
There are multiple pdf standards and it is not said, that the text is in a textlayer.
So pdf to text transformation is a gamble if unsupervised.
If you don’t want to work with spatial grouping you can as well just remove html and even stop words like “the” or “and” might give positive impact depending on the type of document.
Having multiple specialized data pipelines based on document type are recommended.
Just create an email viewer app, in flask with Dash, that you can share with them, regardless of the file type. Could be easier. I just did one for a datascience project I was working on and it only took about an hour.
It’s suprising easy. Will have a GUI interface, and they review whatever they want on the front end in whatever format you choose. Also looks super professional. Only possible to build with 1o mini and 1o at this point though.
Just tell it what you want it to do, and how you want it to look. Then go back and forth with it until you get your final project going. Also pretty fun if you want to geek out with the AI possiblities cheers
LOL! Probably what everyone started doing, and copying and pasting text. Yes Adobe created it back in the day to stop people from copying sensitive documents. Been a pain every since if you ever wanted to copy or parse anything that’s for damn sure.
Converting to PDF sounds like a bad idea from the start - for one thing, you’ve lost all the headers (Message-ID, In-Reply-To, References, Cc: … maybe even dates and sender/recipients?) and you’ve lost all the original formats of all the MIME parts, and probably aren’t going to get any attachments at all?
I’m processing emails using “tika” - it’s “ok”, but a lot of manual extra-work is needed depending on what you’re needing it for.
The issue is that I have large PDFs, and I want it to give me, for example, the most trending companies (the most mentioned ones). But it can’t count words properly—it gives a different result every time.
Do you think it’s better to use another technique or another format?