RAG is failing when the number of documents increase

I feel like for longer (agentic) process flows where it is not just about answering a single specific query it could be an interesting approach to loop in the human and avoid surprises in the final response/output.

Ideally your system should be able to identify when it needs human assistance.

Iā€™m thinking it should be possible to identify whether your rag results are focused enough or whether you need further refinement in your query :thinking:

1 Like

This is the approach I am taking for now

1 Like

Thatā€™s what I call ā€œvector dilutionā€: when your embedded chunks contain more than one precise idea (what I call atomic idea, because it doesnā€™t make sense to break it further down into smaller pieces) their space position is kind of ā€œblurryā€ or ā€œdilutedā€ across the ideas it contains.

So when you run the retrieval query, which in most cases contains only one idea in it and is more precise, it is hard to get the high similarly match with ā€œdilutedā€ vector of the chunks youā€™re looking for.

When your corpus is small it is not easy to diagnose. But when the amount of data grows, you end up with tons of chunks that are spread all over the place in your rankings and if your cap on how many chunks you should include into the prompt is small (which it should be), you miss the important info.

Ideally, your chunks should contain only one idea at a time and a mean to trace them to their source to be able to pull more context into the prompt if needed (because atomic idea chunks often are not enough to answer complex questions). So your retrieval becomes multi-step: find chunks that match the query, pull more context from the chunk sources references, build the prompt - and only then answer.

But itā€™s easier to say than do (unless you have a robust chunking tool, see some of my previous messages on the subject).

But you can still work even in this situation.

Here are some ideas:

Donā€™t cut the number of results based on similarity to the query, but rather on their ā€œusefulnessā€: increase the number of chunks you pull out. In order to make them fit into your prompt, instead of pushing all results there- select those that either contain the answer or the additional information that helps to improve the answer. This way youā€™ll trim your results in an extra step but youā€™ll improve the quality of the answer.
Once the results are pulled out, run them in parallel against a model trained to evaluate their ā€œusefulnessā€. Then select only the ones that passed the test (again, I donā€™t have the link handy, but you can search in my messages on the forum to get more details).

But having little info about your app and the data structure schema, itā€™s hard to come up with something truly useful.

1 Like

Donā€™t take offense please, but to me it sounds like you probably missed some important details in designing your data structures what causes your vectors not reflect enough the info youā€™re trying to find. Whatā€™s the goal of the retrieval and how your chunks look like?

Just a follow-up. In addition to implementing my own Semantic Chunking strategy: Using gpt-4 API to Semantically Chunk Documents - #166 by SomebodySysop as well as Small-to-Big chunk retrieval (for better chunk context): Advanced RAG 01: Small-to-Big Retrieval | by Sophia Yang, Ph.D. | Towards Data Science

I also deployed your ā€œDeep Diveā€ strategy. Essentially, I take the top 50 (or even 100) cosine similarity search results, and rate each chunk based upon itā€™s relationship to the actual question asked. I do one chunk at a time, which ensures the best model response. I then return the highest rated chunks together as context to the model for a complete answer.

Using OpenAIā€™s new text-embedding-3-large embed model.

Not only is this process faster than I thought it was going to be (since each API call only returns a single rating number, in my case 0-10), but also far less expensive than I imagined (especially with the new gpt-4o-mini and gemini-1.5-flash models).

This works amazingly well. I actually thought I had conceived the concept of ā€œDeep Diveā€, but looks like you beat me to it! Anyway, thank you so much for this contribution. it turned out to be the key to my issues with getting ā€œcomprehensiveā€ responses.

3 Likes

Out of curiosity, can I ask how specifically you evaluate the relationship?

My system is very user-defined. So, I kind of play around to develop a System Message which the model will use, along with the original query, to give each content chunk a ranking.

Based on trial and error, I come up with a ranking level threshold which has the best chance of selecting the most relevant chunks.

Note that this System Message could be different depending upon the general type of documents being processed. Your provided example(s) should reflect this.

1 Like

This is what I was thinking as an alternative with regards to using mamba2 or better. I feel future systems will not just be MoE, but also have interoperability between the transformer architecture and SSM archtectures.
There needs to be more programmers like you figuring out these problems so that we can get past AGI soon/// :wink:

2 Likes

Why zero to ten ? 10 has 2 digits and kind of falls out of the row 0-9.

1 Like

No logical reason. Habit? As usual, you bring up an excellent point. Thanks!

Way easier to work with static number of digits. You never know where the project will end up.

Just in case you think no one is paying attention when you provide these very detailed posts:

  • Reasoning Behind ā€œComprehension Levelā€, ā€œFilter Resultsā€ and ā€œDeep Diveā€ Strategies.
    • RAG is failing when the number of documents increase - #24 by sergeliatko
    • Ideally, your chunks should contain only one idea at a time and a mean to trace them to their source to be able to pull more context into the prompt if needed (because atomic idea chunks often are not enough to answer complex questions). So your retrieval becomes multi-step: find chunks that match the query, pull more context from the chunk sources references, build the prompt - and only then answer.
    • Donā€™t cut the number of results based on similarity to the query, but rather on their ā€œusefulnessā€: increase the number of chunks you pull out. In order to make them fit into your prompt, instead of pushing all results there- select those that either contain the answer or the additional information that helps to improve the answer. This way youā€™ll trim your results in an extra step but youā€™ll improve the quality of the answer.
      Once the results are pulled out, run them in parallel against a model trained to evaluate their ā€œusefulnessā€. Then select only the ones that passed the test
      • Deep Dive (RAG is failing when the number of documents increase - #26 by SomebodySysop) automatically increases the number of results returned. However,
      • Filter Results (Using gpt-4 API to Semantically Chunk Documents - #172 by SomebodySysop) filters the returned chunks based upon their relevance to the question.
      • The result of these two above are answers that are more comprehensive and which arenā€™t diluted by unnecessary noise. You may expand your search to 100 or 200 chunks, but each chunk is examined to determine itā€™s relevance to the actual question. So, in reality, you end up returning only a dozen or less highly relevant chunks to the model, thus dramatically decreasing your prompt size but increasing the quality of your responses.

This combination of tools have been working fabulously. Thank you, again, for your contributions!

2 Likes

Thanks for the feedback. Pleasure to read that. Yes, sometimes Iā€™m a bit too wordy, but the subject is kind of tricky one. Just crazy as all of us went from "how do I complete the phrase?ā€ to data driven RAG workflows optimisations in a couple of yearsā€¦

2 Likes