RAG is failing when the number of documents increase

SomebodySysop · January 6, 2024, 2:47am

I face the same challenge dealing with very large knowledge base datasets. Similar to you, what I sometimes end up with is a good answer, but not the most comprehensive answer. As mentioned earlier, the first thing you may want to look at is your embedding strategy.

I always use Semantic Chunking to make sure each chunk is as relevant as possible to it’s context: https://youtu.be/w_veb816Asg

Next, I use metadata as much as possible to identify where in the overall document each embedded chunk belongs:

I have an object property called “Questions”, which are questions that this particular chunk answers. This has proven to help raise the SNR (signal to noise) ration significantly.

As for the embeddings themselves, I use text-embedding-ada-002 which produces vectors with 1536 dimensions

I can’t speak to using multiple embedding models as I’ve not done this, but I use Weaviate as my vector store, and it has an embedding transformer text2vec-openai which has been working very well for me for several months now.

Now, all that said, sometimes with some 50 results, I still do not get the most comprehensive answers with some document sets (particularly the labor agreements and religious sermons). I have found that increasing the document retrieval limit much past 50 does not always yield the anticipated results. The model, gpt-4-turbo in this case, may have a much larger input context window, but the more text you put in it, the less efficient it is at understanding that text. See: Anthropic's best Claude 2.1 feature suffers the same fate as GPT-4 Turbo

So, I’ve come up with my own strategy, which I call “Deep Dive”. It is similar to this approach, where we use the LLMs to extract more comprehensive results: Biggest difficulty in developing LLM apps - #41 by plasmatoid

But mines is even more simple: I run the cosine similarity search and return the top 50 - 100 retrievals as discussed earlier. But, instead of returning those to the primary LLM to evaluate, I use a cheaper model (with lower context window) to evaluate the results in batches of 10-15 and give me the most relevant matches. Once this LLM has evaluated all the results, I then send it’s results, along with the most relevant documents, to the main LLM. From here, I do one of two things: either return the concatenated results of the secondary LLM, or have the primary LLM evaluate the top relevant documents. Not sure which one works best yet.

Anyway, my two cents.

Topic		Replies	Views
Document Sections: Better rendering of chunks for long documents Prompting vector-db , semantic-search	66	31820	April 1, 2025
The length of the embedding contents API	48	34313	December 13, 2023
Scaling RAG chatbot system to millions of documents API gpt-4 , prompt-engineering , rag	18	6298	February 28, 2024
What's the most accurate? Fine tunning vs Prompt Stuffing Community fine-tuning	13	5117	October 2, 2023
Prompt engineering for RAG Prompting chatgpt , rag	22	65144	June 25, 2024

RAG is failing when the number of documents increase

Related topics