RAG or Fine tuning for a domain specific QA chatbot

Hello, community.

I am trying to gather information about creating an AI powered Question Answering chatbot. I have 4 datasets, each of different size( 300000 documents, 350000 documents, 1.9 million documents,850000 documents. These are html report files but can be converted to text files). I am thinking of using a pre trained model and either fine tuning or utilizing RAG to answer users questions about domain specific data that will be in the mentioned reports and also summarize a user provided lengthy document. For summarization, I believe most models do not need further fine tuning so I was thinking of just using a model as an out of the box solution. For question answering however, according to my research, fine tuning will only change the way a model speaks but doesn’t help much with retrieving facts. Is this correct? Can RAG be implemented on a dataset of 1.9 Million files? Each file could be 40 KB so on an average the dataset could be around 74GB. The model doesn’t have to be trained or RAG shouldn’t have to be implemented on all these datasets combined but there should be a separate instance of the model for each dataset. Please let me know if RAG alone is the best option or if there are other ways of retrieval that I can use for this use case(Also, the solution has to be cost effective as we do not have a lot of manpower or computational resources)?


You are right, you definitely need to use RAG. You can run a DB locally (your own local Vector DB), and generate embeddings locally. I’d recommend looking up RAG + LangChain on Youtube to see how (but you probably already did), because this is a really common need that people have, and so there are pretty fully complete end-to-end solutions out there you can use without having to write much of the code yourself. Good luck!


Thank you for your response @wclayf ! The thing I am confused about is, does RAG have the capability to work on 2 Million documents(each document could be anywhere between 2 pages to 90 pages)? The implementations that I saw over the internet was for max 100K documents. Is it really the best solution for such a vast dataset?

With a massive dataset like that you will almost certainly want a local Vector DB that you run on your own machines. I think the pgvector plugin for Postgres DB is probably your best bet. I haven’t used it yet, but Postgres is really the best open source and free DB out there, and pgvector is it’s way of doing this stuff. I bet if you search youtube for “Python LangChain Vector database Postgres pgvector” you can find examples of people showing how to do it.

Thank you for explaining that! I will definitely consider your suggestion!