RAG or Fine tuning for a domain specific QA chatbot

rohithar · June 27, 2024, 8:06pm

Hello, community.

I am trying to gather information about creating an AI powered Question Answering chatbot. I have 4 datasets, each of different size( 300000 documents, 350000 documents, 1.9 million documents,850000 documents. These are html report files but can be converted to text files). I am thinking of using a pre trained model and either fine tuning or utilizing RAG to answer users questions about domain specific data that will be in the mentioned reports and also summarize a user provided lengthy document. For summarization, I believe most models do not need further fine tuning so I was thinking of just using a model as an out of the box solution. For question answering however, according to my research, fine tuning will only change the way a model speaks but doesn’t help much with retrieving facts. Is this correct? Can RAG be implemented on a dataset of 1.9 Million files? Each file could be 40 KB so on an average the dataset could be around 74GB. The model doesn’t have to be trained or RAG shouldn’t have to be implemented on all these datasets combined but there should be a separate instance of the model for each dataset. Please let me know if RAG alone is the best option or if there are other ways of retrieval that I can use for this use case(Also, the solution has to be cost effective as we do not have a lot of manpower or computational resources)?

wclayf · June 27, 2024, 9:26pm

You are right, you definitely need to use RAG. You can run a DB locally (your own local Vector DB), and generate embeddings locally. I’d recommend looking up RAG + LangChain on Youtube to see how (but you probably already did), because this is a really common need that people have, and so there are pretty fully complete end-to-end solutions out there you can use without having to write much of the code yourself. Good luck!

rohithar · July 2, 2024, 10:37pm

Thank you for your response @wclayf ! The thing I am confused about is, does RAG have the capability to work on 2 Million documents(each document could be anywhere between 2 pages to 90 pages)? The implementations that I saw over the internet was for max 100K documents. Is it really the best solution for such a vast dataset?

wclayf · July 3, 2024, 1:21am

With a massive dataset like that you will almost certainly want a local Vector DB that you run on your own machines. I think the pgvector plugin for Postgres DB is probably your best bet. I haven’t used it yet, but Postgres is really the best open source and free DB out there, and pgvector is it’s way of doing this stuff. I bet if you search youtube for “Python LangChain Vector database Postgres pgvector” you can find examples of people showing how to do it.

rohithar · July 3, 2024, 7:18pm

Thank you for explaining that! I will definitely consider your suggestion!

Topic		Replies	Views
How to use RAG properly and what types of query it is good at? GPT builders chatgpt	8	10504	June 17, 2024
Creating a Q&A chatbot using fine-tunning with RAG API embeddings , fine-tuning , rag	3	906	May 25, 2024
Bad formats for semantic search of RAG? Implementing internal chatbot for troubleshooting an SDK API	4	191	July 1, 2024
Finetune or RAG for support chatbot API	0	45	September 21, 2024
Problem with doing RAG with 300k pages of PDFs Community gpt-4 , gpt-35-turbo , api	8	4039	March 7, 2024

RAG or Fine tuning for a domain specific QA chatbot

Related topics