The issue with “Chunk based RAG” is that it’s really hard to get your chunks right. There are lots of cases where the users query is asking for a very specific needle and if you do your retrieval properly one of the chunks you retrieve will have the needle in it letting you successfully answer the users question. I’ll even go so far and say that chunk based RAG will likely work 80% of the time but it’s the other 20% that keeps me up at night. Let me give some examples of where chunk based RAG is prone to failure.
Order Sensitivity
Lets say your query is “what are the steps needed to do job xyz” you should have good similarity between the query and answer text so this is a query that RAG should do well on but it’s actually your chunking strategy that can cause a bad answer. If this is a really long step sequence it’s likely going to span multiple chunks so there are at least two things that could go wrong.
The first potential issue is that you retrieve all of the steps but you show them to the model in the wrong order. If the steps are numbered there’s a chance that the model will correct the order in its output but it also might leave them out of order or worse re-number them. Somewhere on here I have a post that goes in depth into document ordering issues and you should always re-order your chunks to be in document order as a post processing step.
The second issue is that your steps get cut off because you simply fail to retrieve the chunks with all the steps. The longer the sequence the more likely it is that some of the needed chunks will have low similarity to the users question and you risk dropping them. One thing I hope we can all agree on is that if you don’t show the model the information needed to answer the question it can’t include that information in its answer.
Multi Needle Reasoning
For some questions the model needs to see multiple needles to properly answer the users question and these needles are rarely in the same document let alone the same chunk. One of the test queries we use at my company requires 23 documents to answer the question but that’s an extreme.
This is probably easiest illustrated with a simple multi-hop question but multi-hop questions open up a whole new can of worms. The query is “who had the highest sales this month for the North America region”. If you’re lucky all of the information needed to answer this question is in the same document but if you’re unlucky you’ll need to pull in 2 documents, one with the monthly sales figures and another with a list of sales members in north America.
If you can get all of the needed information into the context window a GPT-4 class model is more than capable of answering this question. The trick is getting it all into the context window. You’re generally going to need to fit all of the sales figures into the window and these are going to have poor similarity with the query. And you’re also going to need to fit all of the sales members into the context window which will also have poor similarity with the query. So you’re battling two things… You have a lot of disperse information you need to show the model and it all has poor similarity with the query.
If I had to guess… Most chunk based RAG systems will fail to reliably answer this query for corpora of any significant size and it’s a very reasonable question to want to ask of your RAG system.
Summarization Queries
This ones kind of an easy one to see how chunk based RAG systems are going to fail. If your query is “summarize xyz” and xyz is too big to fit in the context window, you’re not going to be able to properly summarize the document. You have to use a completely different strategy for summarization, like recursive summarization, but the challenge with that is knowing that you need to use this other strategy.
My Strategy: Whole File Reasoning
For all of the reasons above I really don’t like slicing up individual documents so I don’t (or I mostly don’t.) To me documents represent human authored boundaries of information. They’re chunk boundaries that a human has authored so I do everything I can to respect and preserve those chunk boundaries. I’ll use BM25 search to figure out which documents most likely contain the information needed to answer the question but then I reason over the “whole file” regardless of its size. I have techniques that I view as secret sauce for dealing with really large files but the point is I do everything I can to avoid creating artificial chunk boundaries.
In my testing this “chunk-less” approach to RAG addresses the issues I called out above. If anything I have the opposite issue with the summarization case because if you ask my system to “summarize everything” it will go off and try to summarize thousands of documents. I have to pull back on reigns in some cases.
There’s no doubt that the approach I take consumes a lot of tokens but the cost of running my system is already 1/10th the cost of when I started building it. My approach benefits from context window lengths going up and token costs going down. Both of which are true.
Note: I say “whole file” and not “whole document” because sometimes a document is broken into multiple files by a human author. This Medicare Claims Processing Manual document is one of the larger documents in our test corpora at about 3 million tokens but it’s broken into 39 separate PDF’s so we’ll use BM25 search to first identify the chapters we need to reason over but then we show the model every token from the chapters we selected.