I have a bunch of PDFs and a chatbot. My issue is that I can only send a few pages at a time since the pdfs can be 100+ pages. A lot of these pages are almost exactly the same with slightly varying information, but different dates - this is critical. How can I make sure that we can retrieve with the RAG navigating via dates? We do the obvious: embeddings, OCR, and chunking already.
Hi @veeraj and welcome to the community!
Just to clarify: a given document may contain 100+ pages, and the content across these pages varies only slightly, but the dates change. Your issue is that since the content is very similar, you want to ensure the responses are precise, according to the date. Is this correct?
Here is how I would approach this:
- Upload pdf
- OCR the whole file
- Use LLM to fix formatting and OCR errors
- Use code + LLM to segment the PDF in manageable sections where the section represents a single “record” to be used as a retrievable item
- Use LLM to convert section in an object or rather entity to be stored in a database containing the searchable fields to leverage standard SQL queries
- Use that object or actually a cloned version of it with some or all of the fields as an embedded object to store in a separate Vector database
- Embed and save the vectorized object in Vector database and connected by ID with the regular database record
- For the retrieval you would use a combination of SQL and Vector search to find the objects you need
- And use some logic to get the results from both searches combine them and produce the final list of results
I would definitely use weaviate for the vector side management and their indexing also might allow you to get filters by dates if needed
Thanks for the reply. This is an interesting approach. The only problem is that we have to do this thousands and thousands of times across many many customers, so using the LLM for this step would probably get too expensive.
Yes, and the only thing I would add is that each document has a lot of pages, and there are thousands of them per user/customer. also, we don’t “know” the date itself so not something we can hard code if that makes sense
So, use code and regex… Whenever possible
Got it. So as @sergeliatko suggested, it’s about applying regex on the date format (worst case scenario there are different formats, but this is still completely doable with few regex rules).
What you want to do is build an associated metadata map (just a simple JSON object), for each chunk. The metadata will contain things like document id, page number, section/sub-section (if appropriate), and the last date that was extracted. So for each embedded text chunk, you will also have this metadata associated with it.
This gives you lot of power when doing RAG since you can try to disambiguate the date, or even provide a response with multiple dates.
I’ve been running my own private PDF to Markdown translator on a cloud based A100.
I feed 16 page PDF’s no prob, no idea what the limit is, but if this is an issue you could slice the PDF a page at a time, or logical page chunks as a batch. Keep track of the metadata of these pages and chunks and associate it with the Markdown coming down. (Personally I save the raw markdown and also distill the markdown into structured JSON, and save all this in a DB)
Then chunk and RAG away. The challenge is your data is so similar, the retrieval is likely to be bogus. You may have to hard cut with a combo of code and some sharper than semantic algos.
Let me give you an example. I have tried address matching using embedding vectors. It was a disaster. I had better success matching on street number and zip in a DB, then using difflib (python builtin) to resolve any ties.
Semantics (embeddings) can be very very fuzzy.