Best way to interact with PDF 2025

oussama.sedrati · January 22, 2025, 9:17am

Hello,

I am working on a Python project using OpenIa API that processes emails daily and interacts with them. Currently, I download emails as PDFs and interact with these PDFs (e.g., extracting text, creating a vector store, etc.). It works, but I am not satisfied with the answers, especially compared to uploading the same PDFs to the ChatGPT website, where the responses are much better.

Am I approaching this the right way? What’s the best method to read PDFs? I’ve noticed some approaches convert PDFs to images, while others use an assistant directly. In my case, the PDFs can include graphs, and I feel that assistants don’t fully understand the graphs.

Step 1: Extract text from PDF

def extract_text_from_pdf(pdf_path):
text = “”
with open(pdf_path, ‘rb’) as file:
reader = PyPDF2.PdfReader(file)
for page in reader.pages:
text += page.extract_text()
return text

Step 2: Preprocess and store the text

def create_vector_store(pdf_text):
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # Smaller chunks for embedding
chunk_overlap=50 # Minimized overlap to reduce token count
)
chunks = text_splitter.split_text(pdf_text)
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
vector_store = FAISS.from_texts(chunks, embeddings)

return vector_store
Step 3: Create a conversational chain

def create_conversational_chain(vector_store):
llm = ChatOpenAI(
model=“gpt-4”, # Use a cheaper model
openai_api_key=OPENAI_API_KEY,
max_tokens=300, # Limit response tokens
temperature=0 # Deterministic responses
)
retriever = vector_store.as_retriever(search_type=“similarity”, search_kwargs={“k”: 3}) # Fewer chunks retrieved
qa_chain = ConversationalRetrievalChain.from_llm(llm, retriever)
return qa_chain

Thanks for your help

platypus · January 22, 2025, 12:26pm

Hi @oussama.sedrati !

If you know that your documents have substantial graphical content (charts and tables), which is not expressed in plain text, then I would suggest treating each page as an image. How I’ve done this in the past, is convert each page to an image, and then send with a system prompt that describes how to output in Markdown format. You can also describe how to treat tables and charts - either converting them also to native markdown, or converting to another format (YAML works very well).

The rest of the process looks good to me (chunking and vectorizing text and creating conversational chains). For graphs and tables, I would try to keep them “together”, i.e. so you don’t split a single graph into separate chunks.

oussama.sedrati · January 22, 2025, 2:32pm

I have used this How to parse PDF docs for RAG | OpenAI Cookbook
Hope the answers will be better

jochenschultz · January 22, 2025, 2:50pm

You might as well combine different strategies… like contextual analysis on each word where you put that into subgraphs of an information graph and find relations.

A Systemic Functional Linguistics graph displays an ideational metafunction assignment for the word "gib," categorized as an imperative, with nodes labeled "Modus" and "Thing." (Captioned by AI)

or spatial grouping (e.g. find a recipient address on an invoice - add a bounding box around it and when another invoice in same format comes you check if all bounding boxes in that rectangle have been added to the structured output format)…

or use a CNN to analyse floor plan drawings or statistics inside PDF to get relations…

There is a lot more than just NLP in semantical extraction…

There are guys who put stuff like this in their CV to show their skill level:

So you have to use an “ant algorithm” to find it haha…

The image shows a row of five gray stars on a white background. (Captioned by AI)

And then this:

Where you might end up with creating a lot of data to create visual models…

I am just saying the PDF extraction is like a whole IT departments job for itself… Enough work for decades

Not to mention that you can also use regular expression, grounding, training a CNN on labled documents and many many many other techniques…

This is me trying to solve this for years:

And you might want to look into this:

jlvanhulst · January 22, 2025, 7:04pm

IMO Llamparse is the best (and easiest) hosted solution to parse anything, including PDF’s.

Emails are not that hard to process programatically,
My blog post about processing GMAIL emails might be helpful?

platypus · January 22, 2025, 7:14pm

Ok if it’s managed solutions then I would also recommend Unstructured. But honestly I’ve done multi-modal PDF parsing “manually” (e.g. using Pymupdf and GPT-4o) and it’s really not that hard.

jochenschultz · January 22, 2025, 7:18pm

Just registered at Llamaparse… uploaded a document and it takes more than 3 minutes already on a single pdf with 117 pages. Still “Loading”…

Is that a joke?

I’m just saying - my own OCR processor will find 97 out of 100 cases that I am searching for in a pdf - where assistant only find 51 at most and everytime assistant finds a different amount sometimes only 10 sometimes 25.

And damn I fought so hard to get it down from 12 seconds to less then 6… And I am still unhappy with that speed.

Wow - was writing all this and it still says “Loading” Is that some kind of a split pages and loop over them in one thread thing?

jlvanhulst · January 22, 2025, 7:33pm

It really depends on the data doesn’t it?
A lot of PDF’s are easy. But then a lot of them really are not. Pitch decks for example - and platforms like Llamaparse and Structured make it ‘irrelvant’ what the input is - you get markdown back, with tables for charts etc.
So yeah … right tool for the right job.

jochenschultz · January 22, 2025, 7:40pm

I felt so ashamed on the 12 seconds at first that I put mercure server in my infrastrcuture so it would push a loading bar after each step finishes…

The document is not finished on LlamaParse and I uploaded it 36 minutes ago. I wouldn’t take this long to extract the data manually…

jochenschultz · January 22, 2025, 7:52pm

They have a great design but no fileupload as connector… and you can set a runtime limit on jobs of up to 12 hours…

heronfree · January 23, 2025, 7:16am

Just an outside of the box question but do you have to download the emails in pdf or can you just download the source code, html, images etc?

Or build code to convert the pdf back to the source code and then have ChatGPT process that.

Processing PDFs can always sometimes be a bit of a pita with all the converting. I’m not surprised ChatGPT is missing things. 4o would miss things,1o and 1o mini might not.

If you can get your hands on the original code, it could make things easier for you though
Just a thought.

oussama.sedrati · January 23, 2025, 8:48am

Hi,

I wanted to convert the document to PDF because I was considering the possibility of other teams using the same tool by placing the file in a shared folder, making it accessible for interaction.

For now, I am using this resource ;
[OpenAI Cookbook Example - Parse PDF Docs for RAG]

I’ll let you know if I find a better solution.

Thank you all,

jochenschultz · January 23, 2025, 9:24am

There are multiple pdf standards and it is not said, that the text is in a textlayer.
So pdf to text transformation is a gamble if unsupervised.

If you don’t want to work with spatial grouping you can as well just remove html and even stop words like “the” or “and” might give positive impact depending on the type of document.

Having multiple specialized data pipelines based on document type are recommended.

heronfree · January 23, 2025, 7:09pm

Yep. It was specifically designed to stop people from copying text and images from documents in the first place. So what are you gonna do?

jochenschultz · January 23, 2025, 7:11pm

It was? But you could always just make a screenshot from it…

heronfree · January 23, 2025, 7:23pm

Just create an email viewer app, in flask with Dash, that you can share with them, regardless of the file type. Could be easier. I just did one for a datascience project I was working on and it only took about an hour.

It’s suprising easy. Will have a GUI interface, and they review whatever they want on the front end in whatever format you choose. Also looks super professional. Only possible to build with 1o mini and 1o at this point though.

Just tell it what you want it to do, and how you want it to look. Then go back and forth with it until you get your final project going. Also pretty fun if you want to geek out with the AI possiblities cheers

heronfree · January 23, 2025, 7:31pm

LOL! Probably what everyone started doing, and copying and pasting text. Yes Adobe created it back in the day to stop people from copying sensitive documents. Been a pain every since if you ever wanted to copy or parse anything that’s for damn sure.

qak · January 27, 2025, 6:21am

Converting to PDF sounds like a bad idea from the start - for one thing, you’ve lost all the headers (Message-ID, In-Reply-To, References, Cc: … maybe even dates and sender/recipients?) and you’ve lost all the original formats of all the MIME parts, and probably aren’t going to get any attachments at all?

I’m processing emails using “tika” - it’s “ok”, but a lot of manual extra-work is needed depending on what you’re needing it for.

jochenschultz · January 27, 2025, 8:19am

May I ask which “back in the day” you are refering to? Is it solved yet?

oussama.sedrati · January 27, 2025, 4:16pm

Hello,

It’s really crazy that ChatGPT (or even DeepSeek) can’t read a simple PDF. I have an example:

I only asked how many times the word “TESLA” or “Tesla” or “Tesla.” appears in the file and sometimes it says 13…

The issue is that I have large PDFs, and I want it to give me, for example, the most trending companies (the most mentioned ones). But it can’t count words properly—it gives a different result every time.

Do you think it’s better to use another technique or another format?

Thanks!

Topic		Replies	Views
How Does One Extract Sequentially From a PDF API assistants-api , gpt-4o , gpt-4o-mini , file-search	14	853	October 6, 2024
Search long pdf for specific table - possibly need fine tuning model API gpt-4 , fine-tuning , api	10	2924	March 29, 2024
Phas -> Forest Of Thought Community project , tree-of-thoughts , reasoning , ai-reasoning , forest-of-thoughts	16	383	December 31, 2024
How to feed data for completions, instead of using prompt/answer fine-tuning format? API	25	17347	December 17, 2023
Document Sections: Better rendering of chunks for long documents Prompting vector-db , semantic-search	65	30358	September 27, 2024

Best way to interact with PDF 2025

Step 1: Extract text from PDF

Step 2: Preprocess and store the text

Step 3: Create a conversational chain

Related topics