Ask questions about a pdf without storing it in vector database

Hello Community!

I am trying to build a chatbot that has three objectives:
1)Answer questions about domain specific data(financial reports) . I am using RAG to implement this.
2)Summarize the documents(financial reports) that are already chunked and embedded in vector database.
3) Upload user chosen file and ask questions to get answer from that file.

My questions:
I have done research and was able to understand how to go about the first objective. For the second objective I am thinking to use map reduce technique as a vector database will already be in place for the first objective. IS this the only way or is there any better way? For the third objective- I do not want to store the user provided file in to the vector database. I want the file to be used for just that conversation and be discarded after without actually storing it anywhere. How can that be possible? Is it just like reading the file and extracting text from it and sending that text to the model for every every query including follow up query in that conversation? What if the file is large and takes up almost 90% of the context limit ? IF that is the case then if the context limit runs out, is truncating the previous messages for the follow up questions the only way? Also, what if the file is too large to fit in the context window?

If you are using Assistants you can attach files on the Message level, which you can then discard the Thread when the conversation is over.

I’m not sure what you mean by map reduce a document to summarize it. You can run GPT on the documents to distill them.

If the file is too large then it will be chunked and embedded in a vector store. Otherwise, the complete content is inserted as context. If you run out of token length, the conversation is truncated. Not the context (I believe, may need to verify this)

Thank you for your response !I was referring to the following strategy for summarization:

  • Map Reduce
    • Chunk document. Summarize each chunk, then summarize all the chunk summaries.
      I am thinking of implementing this by storing summary of a chunk along with the vector embeddings and actual text in the vector database creating the embeddings for my first objective and then retrieving the summaries of all the chunks of a particular document (document which is a report in my case) when user requests for summary of a specific report and summarizing all these individual chunk summaries into a final summary.

The reports that I will be using are kind of long. They could potentially go upto 90 pages and I am assuming this will not fit in context length. I do not want to chunk and embed the user given file in to a vector database as the said report might already exist in the vector database and user might upload the document unaware of this, but I don’t want to duplicate the entry so is there any better way?

PS: Just to clarify I will not be using Open AI retrieval for RAG. I will be using a vector database like either Pinecone or Postgresql pgvector

People literally do everything to avoid learning SQL :sweat_smile:

Maybe you can start by explaining what kind of data you want to extract.

Then create a data structure for that preferably inside a rdbms.

Then prompt for the values. You already got the keys in your data structure.

Then insert it and then use SQL to get kpi.

That’s actually an old dispute that reaches back to the 50’s but for me it is clear. You get better answers in SQL language than you can ever get in english language.

I think your RAG should embedd SQL instead of the real data.

You ask for something in english, it will answer with SQL and you get the data or an error to handle.

Which means you should separate data mining and data presentation.

For data mining/extraction I am using json - can check against schema and use regular expression (or also use databases with names in it or location, etc).
Plus you can use information graphs.

So, what are you trying to achieve?
What kind of information do you want to extract?