RAG is failing when the number of documents increase

I wanted to check if others are also facing the same issue.

When we have small number of documents, the embedding fetches n number of docs based on a threshold and then we take top-n out of them which has the potential answer for the question. When the number of documents increase, the chunk with the answer gets pushed down to lets say n-k position and when we take the top-n chunks, that chunk is left out. I can increase the top-n but that is not a sustainable solution, it will fail somewhere else.

Any ideas to solve these type of scenarios.

1 Like


One way is to maintain high quality embeddings.

Do you have a broad corpus spanning a variety of topics? Consider maybe splitting them up, and apply a stratified embedding search.

Does your corpus include documents in multiple languages? Consider harmonizing language and style before storing your embeddings (for recall, it may not be necessary to harmonize your query, but YMMV)

I think the best thing you can do is look at your highly ranked documents and ask yourself how and why they don’t answer the query - and then adjust your approach around that. This is an iterative thing that will likely need to happen at regular intervals as your corpus grows.


Can you please elaborate on the “stratified embedding search” . Is this a retrieval technique available as part of any of the packages.

1 Like

You may want to consider adding a re-ranking model—such as the one from Cohere—into your RAG pipeline.

According to synthetic benchmarks it drastically improves the hit-rate of retrievals.

Another approach might be to use another set of embeddings produced with a different embedding model in tandem with text-embedding-ada-002 then merging the results with mean reciprocal ranking.

Next, if you’re not already, you might consider using hybrid-search where you first filter the possible documents by keyword, date, or some other parameter to ensure the semantic search is only considering documents with a high likelihood of being relevant.

Finally, when using semantic search it has been demonstrated that synthetic responses—even when completely incorrect—are often more semantically similar to target blobs than the messages which generate them. So, you might consider using a technique like HyDE to increase your hit-rate.

All of these approaches involve some work and/or cost to implement, but all of them have been shown to dramatically improve the hit-rate of document searches.

If it’s really critical to you to get the absolutely best retrieval results, I’d just go all out and do them all.

  1. Re-embed all of your chunks with two or three additional embedding models.
  2. When you need to perform a retrieval, generate one or more hypothetical responses.
  3. Compute embedding vectors for the message and all of the hypothetical responses using each of the embedding models. These will be your search vectors.
  4. Perform keyword extraction from the message and responses.
  5. Filter your documents by keyword.
  6. For each of the embedding models compute the cosine similarity between each search vector and all of your embedded chunks and get the rank of each.
  7. Merge these using mean reciprocal ranking.
  8. Keep the top 50–100 retrievals.
  9. Send the remaining candidates to a re-ranking model.
  10. Keep the top n results from this final list based on whatever metric you decide on.

NOTE: This is almost certainly overkill and will run into diminishing returns fairly quickly. So, maybe try them individually and combine them as necessary to get the results you require.

Lastly, there is one other way to improve your retrievals and that is to use a fine-tuned embedding model. I have not yet personally fine-tuned an embedding model, but there are many resources online that can guide you through doing so.

There’s less evidence available for the efficacy of this, but from what I’ve seen it produces a dramatic improvement if your documents frequently use terms or acronyms that are specific to your domain. The idea being that through fine-tuning the embedding model learns the particular significance of these terms better than a general embedding model can, which helps it to better “understand” the semantic meaning in your documents.



I face the same challenge dealing with very large knowledge base datasets. Similar to you, what I sometimes end up with is a good answer, but not the most comprehensive answer. As mentioned earlier, the first thing you may want to look at is your embedding strategy.

I always use Semantic Chunking to make sure each chunk is as relevant as possible to it’s context: https://youtu.be/w_veb816Asg

Next, I use metadata as much as possible to identify where in the overall document each embedded chunk belongs:

I have an object property called “Questions”, which are questions that this particular chunk answers. This has proven to help raise the SNR (signal to noise) ration significantly.

As for the embeddings themselves, I use text-embedding-ada-002 which produces vectors with 1536 dimensions

I can’t speak to using multiple embedding models as I’ve not done this, but I use Weaviate as my vector store, and it has an embedding transformer text2vec-openai which has been working very well for me for several months now.

Now, all that said, sometimes with some 50 results, I still do not get the most comprehensive answers with some document sets (particularly the labor agreements and religious sermons). I have found that increasing the document retrieval limit much past 50 does not always yield the anticipated results. The model, gpt-4-turbo in this case, may have a much larger input context window, but the more text you put in it, the less efficient it is at understanding that text. See: Anthropic's best Claude 2.1 feature suffers the same fate as GPT-4 Turbo

So, I’ve come up with my own strategy, which I call “Deep Dive”. It is similar to this approach, where we use the LLMs to extract more comprehensive results: Biggest difficulty in developing LLM apps - #41 by plasmatoid

But mines is even more simple: I run the cosine similarity search and return the top 50 - 100 retrievals as discussed earlier. But, instead of returning those to the primary LLM to evaluate, I use a cheaper model (with lower context window) to evaluate the results in batches of 10-15 and give me the most relevant matches. Once this LLM has evaluated all the results, I then send it’s results, along with the most relevant documents, to the main LLM. From here, I do one of two things: either return the concatenated results of the secondary LLM, or have the primary LLM evaluate the top relevant documents. Not sure which one works best yet.

Anyway, my two cents.


This way of handling large knowledge base datasets has several rather commendable aspects, but there are areas where improvements can be made for greater efficiency and accuracies. While not garunteed, I believe by following these steps, you could achieve greater results.


  1. Semantic Chunking: This is a great strategy for context relevance. Ensuring each chunk is meaningful within its context can significantly enhance the quality of information retrieved.
  2. Metadata Utilization: Using metadata to trace the location and relevance of each chunk within a larger document is a smart approach. It adds a layer of organization that can make retrieval more precise.
  3. Embedding Choice: Your choice of text-embedding-ada-002 for generating 1536-dimensional vectors captues textual features effectively.
  4. “Deep Dive” Strategy (GPT like response): Using a secondary, cheaper model for initial evaluation and then passing the filtered results to the primary LLM is like a MoE with fact checking. It has been proven ( https://www.youtube.com/watch?v=Zlgkzjndpak ) that using multiple agents to check each other reduces hallucinations and corrects mistakes. With this, there are several pitfalls ( The Pros and Cons of Waterfall Methodology | Lucidchart ) that can be addressed by using a turn based system ( https://www.youtube.com/watch?v=yhBiVrigWNI ) and creating a local warehouse which stores turns and conversations.

Areas for Improvement:

  1. Top-n Retrieval Limitation: Your current method of selecting the top-n documents can miss relevant information if it falls beyond the top-n range. This is a significant limitation, especially for larger datasets where relevant information might be ranked lower. Solution: Implement a dynamic thresholding system that adjusts the number of documents retrieved based on the dataset’s size and the complexity of the query. Machine learning techniques can be employed to predict the optimal number of documents to retrieve for each query.
  2. Single Embedding Model: Relying solely on text-embedding-ada-002 might limit the diversity of your embeddings. Solution: Experiment with a hybrid embedding approach that combines different models as stated above. This can help capture a wider range of textual nuances and might improve the relevance of retrieved documents as long as there is a reference Agent watching the DB or warehouse results.
  3. Limiting Document Retrieval Beyond 50: While increasing the document retrieval limit hasn’t always been effective, sticking rigidly to a number like 50 might be arbitrary and not optimal for all cases. Solution: Use adaptive retrieval limits based on the query’s complexity and the dataset’s characteristics. Machine learning models can help in deciding the optimal number of documents to retrieve, and using LLM reasoning like RAG fusion and advanced rag can help to solve this.
  4. Efficiency of Primary LLM with Large Context: When the primary LLM processes a large volume of text, its efficiency decreases. Solution: Before passing large text volumes to the primary LLM, pre-process the data to condense or summarize information without losing context. Techniques like extractive summarization can be helpful here, or use MAMBA processing SSM6 for longer Seq2 to get better results in an unlimited fasion with less convolution.
  5. Uncertainty in Final Evaluation Step: The dilemma between using concatenated results of the secondary LLM or having the primary LLM re-evaluate the documents indicates a lack of clarity in the final step of your process.Solution: Systematically test both methods across different types of queries and datasets. Analyze the results to determine which method consistently yields more accurate and comprehensive answers. While CoT and ToT are effective in this manner, Microsoft’s new AoT (algorithm of thought) will save energy and shots.
  6. Generalizability Across Diverse Document Types: You mentioned challenges with specific types of documents like labor agreements and religious sermons. Solution: Develop specialized processing modules for document types that are consistently problematic. These modules could include tailored embeddings or specific pre-processing steps that address the unique characteristics of these documents. Due to filtering and belief based restictions on GPT3.5 16k and GPT 4 + or widely available GPT++ models that are corporately available, building your own system based on GPT or MAMBA setups can give you greater responses than those which are filtered.

This is how to do it! You have to actually look at a lot of requests and responses and figure out what’s going on, and then adopt an appropriate fix for each kind of failure.

FWIW, the way I do retrieval, is having one cosine similarity threshold (don’t include any document under 0.7, say) and then a ranked filling of the context. (I also have a fixed upper limit on the count, but that’s higher than what I can typically fit in the context.) I decide on a number of tokens I want to “spend,” and pick top-ranked retrieved documents in best-to-lowest order until the context is full. A smaller document with lower score MAY be included even if a higher-scored document isn’t included, if that document is too large to include. I use a simple greedy linear accumulation algorithm.

But, again: If you find that the “right” document is outside your retrieval window, and a lot of “wrong” documents are inside the window, then why is that? What if you test sub-segments of the documents and sub-segments of the query against each other? Can you compress the query (or the document) to filter out fluff that might dilute the embedding score?

1 Like

Sure! I don’t know if it’s part of any particular packages, but for one product we noticed that it’s better to build a sort of tree.

you have your embedding indices, but they’re categorized by topics. each topic has a topic description that is itself embedded - so first you find the relevant topic(s), and then you look inside those topics to find the actual documents you need.

we tried using center of mass but only had mixed results from the start, so we generate topic summaries instead for this particular case.


Do you start with a pre-defined list of topic summaries, or do you use the model to create a list and assign individual embeddings according to that list?

Or, the opposite. You look at the document (or document chunks) and determine the relevant topics (or keywords or categories) from there. I did something similar with categorization of prompt questions and responses: Using the LLM to Categorize Responses

And, it’s just now dawning on me that I could do the same thing with my embeddings: create and assign global keywords/categories that could be used to filter down searches to the most appropriate chunks.

1 Like

The topics emerge so to speak, if new docs don’t fit into any prior topics a new one can be created

Yeah the basic point is use a semantic search to filter down to a reasonable set of results (re-ranking if sensible, eg prioritising “official staff content”), then only present that content to the LLM.

I even use cosine distance and a threshold to cut out results that are too dissimilar.

This will super charge its capability whilst reducing cost massively! It may also make it much much faster.

This is what I am thinking to do as well

During extraction - use LLM to extract topics for each chunk and store it as metadata
During Generation - have llm determine the topic for the question and use that as a filter

The engine is lacking as it has no real concept of “concepts” and topics/subtopic relevance, and this will never improve until the architecture is changed where you can force or teach importance on a level other than the number of times something appears.

Here is the answer from our custom GPT on the topic →

Regenerative Development Corporation

The ranking and selection of the topics related to regenerative development were not based on a specific hierarchical order or quantitative analysis. Instead, they were chosen and organized based on the principles and themes commonly emphasized in the field of regenerative development. These principles and themes are frequently discussed in literature, practice, and educational resources related to regenerative development and sustainability.

Basically – nearly useless.

I am also trying the cross-encoder reranking approach using a LLM. Did anyone else try it out?

Could you please explain, what did you mean by “Re-embed all of your chunks” in first point. Do I need to re-embed the embeddings or embed the chunks using multiple embeddings separately and store in different collections?

If a document is represented be a single vector, that could be a problem. The problem gets worse for large documents. Since a single vector can’t adequately represent specific content within a document, the vector similarity searches go off the mark. While you observe number of documents creating the problem, my suggestion is to see how big each document is, which could be the root cause.

I also tried creating synthetic questions from smaller chunks. That also does not work well. It looks to me the embedding models both cohere and open ai are not very good at granular context match. We need to find out how we can reduce the surface area of search, which is actually an art by itself.