Ask questions on a large dataset (not only for search use case)

I’m looking to build an AI assistant that can access a collection of documents (about 200MB of Text and HTML files) and it should be able to answer specific queries related to the documents, such as:

  • How many documents are indexed?
  • What is the most frequently mentioned topic across the documents?
  • Can you identify duplicated text across all documents?
  • What is the most used keyword across all the documents?

Can you suggest best practices for setting this up? I’ve tried using embedding and RAG integrated with OpenAI, but for the type of questions I would like to ask, the LLM should have access to all the documents to generate an answer and not only to part of them.

Thanks!

Your goals are similar to mine. I need it to simply get me valid hyperlinks from today’s internet, but it can’t. And then there is the cost in tokens of dropping30 MB in the context stream? Maybe using an open source model and training it on your documents? GPT may not ever be allowed to do what you are expecting.