Problem with doing RAG with 300k pages of PDFs

apt.matrix.gs · February 28, 2024, 8:46pm

Hi everyone, I am developing a project where an AI will have to answer questions based on 300k pages of legal PDFs. The questions the user ask will not vary that much. But what I wanted to ask is how do I build this? Because I already used llama index and langchain for smaller projects, like 10k pages of PDFs, but I never really tried with this many (300k). Is there some new or more difficult RAG technique I will need to use? Will I need to use metadata or hybrid search or some other stuff I am not aware of?

Thanks for taking the time to read this.

PaulBellow · February 28, 2024, 8:56pm

Welcome to the forum!

Check out posts by @stevenic … lots of useful RAG info, and his new startup is doing a lot of work in the area…

Good luck!

TeoR95 · March 1, 2024, 8:32am

Hello @apt.matrix.gs !
I might suggest you to check out my last post, since I am facing a similar issue: use RAG with almost 100k PDFs.

I tried to use it with 60k PDFs, but the result took 10 minutes to reply! Too slow

I am trying to perform fine-tuning right now (since I suppose it is simpler to deploy too), but someone is suggesting me to use RAG with PDFs turned into JSONs or to use PDFs metadata. It would be awesome to check how we should make it work.

I hope this helps you, at least a bit

FREELOGO · March 1, 2024, 6:14pm

Thank you so much for this tips. Love it.
Will follow the topic. Cheers

apt.matrix.gs · March 1, 2024, 8:08pm

Thanks for answering!
First of all, are you using any framework for RAG or are you building it from zero? And also, which embedding model are you using, did you fine-tune it? I am going to try to fine tune a hugging face embedding model and see if I can improve the accuracy. For my use case, I don’t know if the metadata will be useful .

chieffy99 · March 2, 2024, 9:19am

Knowledge not RAG. Files that use like RAG will have the same efficiency as RAG. Using Knowledge of GPTs, you should think about using Knowledge. Everything communicated to GPTs is understandable, it is prompts, and can be used in the same way. If it does not violate the limitations of the system

TeoR95 · March 4, 2024, 9:08am

You are welcome!
Honestly, I followed several tutorials on the Internet and I modified them for my purposes.
For the embedding model, I used: “sentence-transformers/all-mpnet-base-v2”; “sentence-transformers/all-MiniLM-L6-v2”; “BAAI/bge-large-en-v1.5”; “BAAI/bge-base-en-v1.5” (all present in HuggingFace repository). Unfortunately, changing the embedding model did not change anything (“BAAI/bge-large-en-v1.5” and “BAAI/bge-base-en-v1.5” are considered two of the best embedding models).
I have never fine-tuned an embedding model. How do you think to perform it? With PDFs, JSONs or a QA pair file? I am not extremely confident it will super speed up the reply of the model, but who knows it could be a wonderful surprise!

Please, let me know if you succeed in fine-tuning the model! In the meanwhile, I will do more searches.

stevenic · March 6, 2024, 10:45pm

Sorry for the delay. I’ve been mostly offline the last week or so. I do have some techniques and algorithms I’ve developed that let me perform RAG over any size document corpus (potentially millions of docs) but I’m actively building a company around that work so not sure how many details I can share. If you wanted to potentially license the engine I’m building I’d be happy to chat. Just DM me.

SomebodySysop · March 7, 2024, 7:22pm

GPT4 Tutorial: How to chat with multiple pdf files - The Chat Completion Process (R.A.G. / Embeddings)

https://youtu.be/Ix9WIZpArm0?si=tKIb0RzffnU-3UPe
My post:
- https://community.openai.com/t/converting-pdf-files-text-into-embedding… Intelligent Search to a Rails Application with Weaviate

Topic		Replies	Views
Retrieval Augmented Generation (RAG) with 100k PDFs?! Too slow! Community pdf , llm , rag , development	13	28239	October 31, 2024
Using large PDFs to make a ChatBot API chatgpt , api	21	6743	December 15, 2023
RAG with more than 10 files API assistants-api	9	4955	January 15, 2024
Leveraging LLMs with Vast Mechanic Datasets and Guides API api	6	2782	August 31, 2023
what should use to bulid Saas for chating with static 50K document's , Chatgpt Api ? or Langchain ? or lamaindex ? Community gpt-4 , chatgpt , api	4	1068	February 5, 2024

Problem with doing RAG with 300k pages of PDFs

Related topics