I’m looking to build an AI assistant that can access a collection of documents (about 200MB of Text and HTML files) and it should be able to answer specific queries related to the documents, such as:
- How many documents are indexed?
- What is the most frequently mentioned topic across the documents?
- Can you identify duplicated text across all documents?
- What is the most used keyword across all the documents?
Can you suggest best practices for setting this up? I’ve tried using embedding and RAG integrated with OpenAI, but for the type of questions I would like to ask, the LLM should have access to all the documents to generate an answer and not only to part of them.
Thanks!