Ask questions on a large dataset (not only for search use case)

droidrcc · December 4, 2024, 6:12pm

I’m looking to build an AI assistant that can access a collection of documents (about 200MB of Text and HTML files) and it should be able to answer specific queries related to the documents, such as:

How many documents are indexed?
What is the most frequently mentioned topic across the documents?
Can you identify duplicated text across all documents?
What is the most used keyword across all the documents?

Can you suggest best practices for setting this up? I’ve tried using embedding and RAG integrated with OpenAI, but for the type of questions I would like to ask, the LLM should have access to all the documents to generate an answer and not only to part of them.

Thanks!

laecornell · December 4, 2024, 6:16pm

Your goals are similar to mine. I need it to simply get me valid hyperlinks from today’s internet, but it can’t. And then there is the cost in tokens of dropping30 MB in the context stream? Maybe using an open source model and training it on your documents? GPT may not ever be allowed to do what you are expecting.

Topic		Replies	Views
Creating a conversational chat bot with a large data set API	4	3343	March 2, 2023
Team Plan or Custom AI for Large Dataset? Plugins / Actions builders	7	156	November 7, 2024
Answering questions about text file content API	5	9218	December 15, 2023
Need to chat with my documents API gpt-4	1	820	February 5, 2024
CHAT-GPT Search API For Document Upload API	8	30184	December 12, 2023

Ask questions on a large dataset (not only for search use case)

Related topics