Scaling RAG chatbot system to millions of documents

Currently I have a working chatbot which I can use to extract information from ~2000 documents related to the financial sector. However, I have some questions related to the scaling of this project, because the idea would be to have hundreds of thousands, or possibly millions, of these documents.
My first question is how do I handle the limited context. The prompt im using has a context and history variable, and these get large very quickly, so if I want to work with this many documents, how could this possibly work?
One idea I had was to summarize the text before including it in the context. Also, I have been looking at Sparse Priming Representation, but I still cant understand what it is apart from some apparently well formulated prompt.

1 Like

This is a big project that in my opinion being related to the financial sector will be more difficult than others.

The good news is that this would be a huge achievement!

I’d say the first step is to separate these concerns.

Then, focusing on a retrieval & confirmation phase would be my next step: a relatively small, preferably locally hosted model could be used to parse through the context returned and determine when the information is sufficient to respond. You would probably want to fine-tune it as well to understand some terminologies and expose it to some documents. You could pass this information onto your conversational model (GPT), validate it once more, and then respond.

Never seen this before, and it does look like this. I could see this being beneficial but I don’t think this is a solution.

Realistically it’s going to be a massive undertaking to embed all of these documents multiple times using different formulas & schemas. A lot of the time and resources will be spent on refinement. You’re going to need to find some way to ensure that all these documents are separated enough for the model, which may require your own specialized/fine-tuned embedding model ( & 100% will require hybrid search for keywords ).

There are some groups that are creating specialized embedding models. One of them being Voyage, who has an embedding model ranked #2 on MTEB (only 4,000 tokens and 1,024 dimensions though). I think an undertaking like yours would be interesting enough for them to grant you trial access to their finance model.

More advanced and specialized models are coming soon and please email for free trial access.

  • voyage-finance-2: coming soon
  • voyage-law-2: coming soon
  • voyage-multilingual-2: coming soon
  • voyage-healthcare-2: coming soon

You could also break this up into categories and first run a similarity test on the categories to narrow the possibilities. Run it through multiple levels of granularity.


When you say “break up into categories”, could this be done by, for example, classifying all the documents before uploading their emeddings to the database, and then assigning a value to its metadata with the corresponding category?

Yes, and then you could first run the model to classify the query & filter the potential amount of documents for the next round. Not necessary but may help if there’s a lot of overlap that the model can’t separate.

These are just my suggestions and honestly there’s a lot of ways & angles. Hope you can find something helpful out of it and more people can comment with their expertise.

I’ve been thinking about this issue myself as I’m currently working with about 25,000 documents and looking to bid on a contract to process many times that amount.

I think the key is what you and @RonaldGRuckus touched upon: metadata and filters.

Not knowing anything about your documents, or what you hope to retrieve from them, I can’t say whether summarization is a good or bad idea or will help.

What I can say is that adding as much categorization and classification of your documents through metadata will assist you tremendously in your quest to consistently find the right needle in your humongous haystack.

As for the context window, I don’t think it matters if you have 2,000 documents or 200,000 documents – you want to narrow your searches to bring back the few chunks from the few most relevant documents. Unless your questions typically require retrieving dozens or hundreds of documents, you should be able to work within a 100K token context window, even with a 4K token output limit.

And don’t be above using keywords as filters, where appropriate. Trust me, even in semantic search, they help cut down on a lot of noise.


I don’t understand how this would help when you’re already doing an embedding match. Any clear top-level category should already be more than well represented in the embedding vector, or the embedder would be rubbish.

Can you share more what’s mistaken about this assumption?

It’s an option if these types of filters help separate the distances between documents that overlap

Are you adding that metadata to the embedding itself, or just keeping it as the separate metadata in the database? Cause in this last case you would need to process the query to extract the metadata values (so you could then filter the search), right?

Also, to be able to take advantage of the increased categorization and classification, the queries would need to be much more specific, right?

1 Like

I use Weaviate. They use class objects with user-defined properties which I am referring to as “metadata”. You can optionally embed these properties and their titles. So in my use case, the metadata (title, group, taxonomy, summary, questions this chunk answers, etc…) are also embedded.

It depends. Example: I have a dataset of labor agreements from a variety of unions. Each union is categorized into a group (in addition to the name in the title), and further categorized into groups of groups. For example, DGA may be one group, but IATSE a group of groups. All contracts are classified as current or archived (past). So, I have user-selectable options that will automatically filter the requests by the appropriate group(s) and classifications. In other words, simple checkboxes to indicate unions, agreement types, etc… that you want to search.

However, because this metadata is also embedded, I could simply include in the prompt “answer this question as it pertains to the current Union X agreement”. I could add additional classifications to the documents like: overtime, insurance, holiday pay, etc…

As mentioned before, there are a lot of ways to do this. This is just one way I’ve found that has worked for me in multiple use cases.

The point being that this metadata helps to narrow searches and pinpoint the data we are looking for within very huge datasets. You want every chunk to be as detailed as possible (of course, without overloading it).


Let’s say you are an insurance company and you’ve got tens of thousands of documents pertaining to your business which include emails, chats, and text messages as well as policy manuals, guide books, etc…

You sell car insurance, home insurance, apartment insurance, motorcycle insurance and a variety of other policies. You’ve got to imagine that all of this documentation is discussing details about various types of insurance. But, when you do a cosine similarity search, you aren’t bringing back entire documents – you’re bringing back chunks of documents.

So, what way do you have to tell what type of insurance is being discussed in a particular sentence in a particular chunk of any particular email, text, chat transcript, or any other large document?

Contextually, one sentence about one type of insurance will look the same as a sentence about another type of insurance. And the LLM will only know what it sees in the chunk you send it along with the question. And, if you are like a lot of people and your chunk title only refers to the full document from which it is chunked (and not the hierarchical section within the document to which it actually pertains) then your LLM, and you, are in big trouble.

So, metadata not only helps for categorization and subsequent filtering, it also helps the model in understanding the overall context of the documents you are sending it to analyze.

1 Like

+1 for Weaviate.

Essential during the tinkering & development phase of RAG.

1 Like

+1 for SuperDuperDB.
As you can create a good pipeline very fast

The standard method of including the document title and the subheadings that lead to the particularly embedded chunk tend to solve that just fine, though. At least for us :slight_smile:

I guess as long as you have a good document taxonomy already, that serves the same function as an explicit tag set, and if you don’t, adding that would serve that function instead, so that makes sense!

1 Like

Metadata & filters is and has been a necessity for any high-grade & heavy database. It could be that the documents start to share a very similar and condensed semantic space where distances aren’t enough for a confident answer (even with pre-processing) but for documents such as financial records (or in my case, datasheets) that follow a near-exact semantic schema it becomes critical to use metadata as filters.

Even without necessarily requiring these tags, if you know or want to search for a specific type of document, or search for a specific company you can use these metadata tags/filters to speed up the search, as seen in the article below.

To save myself more typing here’s some education:

So realistically even if you don’t need it, it still is beneficial to have (and then there’s also analytics)


Beyond the search strategy optimization. You should also consider what I call the Control Plane. This is Input X \rightarrow Action Y.

The action may influence your search pattern.

For example, “All companies that did Z in the last year”.

So the “in the last year” would create a filter (either pre or post) that you would time gate your results.

The LLM itself doesn’t have reliable time gating, so it has to be done directly in the query.

However, even the “in the last year” may not be 100% reliable, since it would have to be inferred, so a higher SNR solution is to have things like this as user inputs. So the user would set a cutoff date explicitly. Either through a GUI, or through an explicit voice command / text input command, that is acknowledged back by the AI to the user.

Also, don’t forget hybrid search. So embeddings + keywords, or embeddings1 + embeddings2 + … + embeddingsN. Combine all the results back, and fuse into a single ranking.

In addition, using the LLM to reframe the original users input, essentially expand on it, and then use that as the search. So, “Input X”, then ask the AI to produce, for example, “Rephrase Input X from the perspective of a hedge fund manager”. You can do this re-framing from different perspectives, depending on your overall goal (if there is one) with the user.

It can get out of hand, but you could take the users initial input, Q different AI synthesized inputs, and multiple embeddings/keywords, all running in parallel to form some interesting retrievals.

So the compute explodes quickly, and you need to go back to the drawing board, and do more and more pre filtering.

So you need a computationally “cheap” initial filter, followed by a more refined expensive filter. There are many of these, but that’s the general idea.

But I will leave you one to think about … using small vector embeddings as an upfront “cheap” pre-filter, followed by larger embedding vectors as a refinement. You can do this with the new OpenAI embedding models, as I have talked about over here.

1 Like

LLMs in some ways are the worst.

They can create problems, but they can’t solve them :roll_eyes:.
Unfortunately that was my selling point as a programmer.

But, the good news is that their ability to compute and format the problem is enough for computers to ingest and return a solution. I think this directly ties to the metadata/filter conversation.

Semantic embedding does enough, that’s great.

At a certain level logic is required. Being able to separate & process the logic using typical methods will be a timeless solution.

“In the last year” is a simple concept that we can all understand. An LLM can also understand the semantics of this request to pass and logically be processed.

This really resonates with me (as someone who has done sales). The information we hear is much different from the information we consciously process, based on the presumption we make on the person.

And create shortcuts! If a hedge fund manager started asking me the difference between calls and puts my mind would perform a hard reboot and take seconds to respond. Probably a lot of “uuhhh… but you’re a … uuhhh… well so it starts … wait… I need to sit down”

Which I think represents us as well. If bozo the clown said some deeply philosophical shit I probably wouldn’t notice. ENTROPY

1 Like

Process it like a brain would do. Consider introducing a new properties in your workflow. The scale. It’s a game changer. No need super rocket science solution. Use NLU combined with scale. You will see magic. Let me know if you need help. :slight_smile:

@raibd Thanks for sharing your use case. I have a couple of questions.

  • How do you measure the accuracy of your answers generated by the Language Model (LM)?
  • Sometimes, LLMs might amalgamate information from multiple sources, leading to inaccuracies. For instance, if the desired answer pertains solely to document1, but the LM mixes chunks from document1 and document2 indiscriminately, it could produce an erroneous response.

Could you clarify how you address or mitigate such potential issues in your workflow?