Problem with doing RAG with 300k pages of PDFs

Hi everyone, I am developing a project where an AI will have to answer questions based on 300k pages of legal PDFs. The questions the user ask will not vary that much. But what I wanted to ask is how do I build this? Because I already used llama index and langchain for smaller projects, like 10k pages of PDFs, but I never really tried with this many (300k). Is there some new or more difficult RAG technique I will need to use? Will I need to use metadata or hybrid search or some other stuff I am not aware of?

Thanks for taking the time to read this.


Welcome to the forum!

Check out posts by @stevenic … lots of useful RAG info, and his new startup is doing a lot of work in the area…

Good luck!


Hello !
I might suggest you to check out my last post, since I am facing a similar issue: use RAG with almost 100k PDFs.

I tried to use it with 60k PDFs, but the result took 10 minutes to reply! Too slow :frowning:

I am trying to perform fine-tuning right now (since I suppose it is simpler to deploy too), but someone is suggesting me to use RAG with PDFs turned into JSONs or to use PDFs metadata. It would be awesome to check how we should make it work.

I hope this helps you, at least a bit :slight_smile:

Thank you so much for this tips. Love it.
Will follow the topic. Cheers

Thanks for answering!
First of all, are you using any framework for RAG or are you building it from zero? And also, which embedding model are you using, did you fine-tune it? I am going to try to fine tune a hugging face embedding model and see if I can improve the accuracy. For my use case, I don’t know if the metadata will be useful .

Knowledge not RAG. Files that use like RAG will have the same efficiency as RAG. Using Knowledge of GPTs, you should think about using Knowledge. Everything communicated to GPTs is understandable, it is prompts, and can be used in the same way. If it does not violate the limitations of the system

You are welcome!
Honestly, I followed several tutorials on the Internet and I modified them for my purposes.
For the embedding model, I used: “sentence-transformers/all-mpnet-base-v2”; “sentence-transformers/all-MiniLM-L6-v2”; “BAAI/bge-large-en-v1.5”; “BAAI/bge-base-en-v1.5” (all present in HuggingFace repository). Unfortunately, changing the embedding model did not change anything (“BAAI/bge-large-en-v1.5” and “BAAI/bge-base-en-v1.5” are considered two of the best embedding models).
I have never fine-tuned an embedding model. How do you think to perform it? With PDFs, JSONs or a QA pair file? I am not extremely confident it will super speed up the reply of the model, but who knows it could be a wonderful surprise!

Please, let me know if you succeed in fine-tuning the model! In the meanwhile, I will do more searches.

Sorry for the delay. I’ve been mostly offline the last week or so. I do have some techniques and algorithms I’ve developed that let me perform RAG over any size document corpus (potentially millions of docs) but I’m actively building a company around that work so not sure how many details I can share. If you wanted to potentially license the engine I’m building I’d be happy to chat. Just DM me.

1 Like

GPT4 Tutorial: How to chat with multiple pdf files - The Chat Completion Process (R.A.G. / Embeddings)

1 Like