Hello, community.
I am trying to gather information about creating an AI powered Question Answering chatbot. I have 4 datasets, each of different size( 300000 documents, 350000 documents, 1.9 million documents,850000 documents. These are html report files but can be converted to text files). I am thinking of using a pre trained model and either fine tuning or utilizing RAG to answer users questions about domain specific data that will be in the mentioned reports and also summarize a user provided lengthy document. For summarization, I believe most models do not need further fine tuning so I was thinking of just using a model as an out of the box solution. For question answering however, according to my research, fine tuning will only change the way a model speaks but doesn’t help much with retrieving facts. Is this correct? Can RAG be implemented on a dataset of 1.9 Million files? Each file could be 40 KB so on an average the dataset could be around 74GB. The model doesn’t have to be trained or RAG shouldn’t have to be implemented on all these datasets combined but there should be a separate instance of the model for each dataset. Please let me know if RAG alone is the best option or if there are other ways of retrieval that I can use for this use case(Also, the solution has to be cost effective as we do not have a lot of manpower or computational resources)?