RAG is not really a solution

,

Agree and disagree. In my experience, there are basically two types of queries: Needle in Haystack and Haystack in Haystack. The first is where we try and find some fact hidden in a large volume of text, and the latter is where we try and deduce some sweeping idea which is covered by a large part of the text in question. Think, “What does this document say about the meaning of life?”

RAG sucks at the latter, but is excellent at the former. No matter how large the context windows get or how low the price of tokens drop, it will still be a waste of both to submit Proust’s Remembrance of Things Past just to find the Volume or page where he says “We are all of us obliged, if we are to make reality endurable, to nurse a few little follies in ourselves.”

Needle in haystack. In this case, RAG does and will always excel.

As for understanding what that line actually means, giving the model all of the works of Proust ever published will do little good as it has no concept of the real world, or real life.

3 Likes

I think it’s a good way to handle a cache without the need of implementing a damerau levensthein function.

Add a hundret questions to an answer and when the question just slightly differs it will still find the answer. Nothing more and nothing less.

@SomebodySysop I’m wondering why there are so many services popping up like Cassidy, Afforai, Langdock or Nuclia, all claiming to offer comprehensive RAG solutions. But these promises seem to conflict with expert opinions mentioned here.

Simple: It’s a gold rush (AI is), and everybody is trying to sell shovels (RAG).

Everybody’s documents are going to be different. Good luck finding one RAG-size that fits all. But that, apparently, is what they are all trying to do.

2 Likes

The issue with “Chunk based RAG” is that it’s really hard to get your chunks right. There are lots of cases where the users query is asking for a very specific needle and if you do your retrieval properly one of the chunks you retrieve will have the needle in it letting you successfully answer the users question. I’ll even go so far and say that chunk based RAG will likely work 80% of the time but it’s the other 20% that keeps me up at night. Let me give some examples of where chunk based RAG is prone to failure.

Order Sensitivity
Lets say your query is “what are the steps needed to do job xyz” you should have good similarity between the query and answer text so this is a query that RAG should do well on but it’s actually your chunking strategy that can cause a bad answer. If this is a really long step sequence it’s likely going to span multiple chunks so there are at least two things that could go wrong.

The first potential issue is that you retrieve all of the steps but you show them to the model in the wrong order. If the steps are numbered there’s a chance that the model will correct the order in its output but it also might leave them out of order or worse re-number them. Somewhere on here I have a post that goes in depth into document ordering issues and you should always re-order your chunks to be in document order as a post processing step.

The second issue is that your steps get cut off because you simply fail to retrieve the chunks with all the steps. The longer the sequence the more likely it is that some of the needed chunks will have low similarity to the users question and you risk dropping them. One thing I hope we can all agree on is that if you don’t show the model the information needed to answer the question it can’t include that information in its answer.

Multi Needle Reasoning
For some questions the model needs to see multiple needles to properly answer the users question and these needles are rarely in the same document let alone the same chunk. One of the test queries we use at my company requires 23 documents to answer the question but that’s an extreme.

This is probably easiest illustrated with a simple multi-hop question but multi-hop questions open up a whole new can of worms. The query is “who had the highest sales this month for the North America region”. If you’re lucky all of the information needed to answer this question is in the same document but if you’re unlucky you’ll need to pull in 2 documents, one with the monthly sales figures and another with a list of sales members in north America.

If you can get all of the needed information into the context window a GPT-4 class model is more than capable of answering this question. The trick is getting it all into the context window. You’re generally going to need to fit all of the sales figures into the window and these are going to have poor similarity with the query. And you’re also going to need to fit all of the sales members into the context window which will also have poor similarity with the query. So you’re battling two things… You have a lot of disperse information you need to show the model and it all has poor similarity with the query.

If I had to guess… Most chunk based RAG systems will fail to reliably answer this query for corpora of any significant size and it’s a very reasonable question to want to ask of your RAG system.

Summarization Queries
This ones kind of an easy one to see how chunk based RAG systems are going to fail. If your query is “summarize xyz” and xyz is too big to fit in the context window, you’re not going to be able to properly summarize the document. You have to use a completely different strategy for summarization, like recursive summarization, but the challenge with that is knowing that you need to use this other strategy.

My Strategy: Whole File Reasoning
For all of the reasons above I really don’t like slicing up individual documents so I don’t (or I mostly don’t.) To me documents represent human authored boundaries of information. They’re chunk boundaries that a human has authored so I do everything I can to respect and preserve those chunk boundaries. I’ll use BM25 search to figure out which documents most likely contain the information needed to answer the question but then I reason over the “whole file” regardless of its size. I have techniques that I view as secret sauce for dealing with really large files but the point is I do everything I can to avoid creating artificial chunk boundaries.

In my testing this “chunk-less” approach to RAG addresses the issues I called out above. If anything I have the opposite issue with the summarization case because if you ask my system to “summarize everything” it will go off and try to summarize thousands of documents. I have to pull back on reigns in some cases.

There’s no doubt that the approach I take consumes a lot of tokens but the cost of running my system is already 1/10th the cost of when I started building it. My approach benefits from context window lengths going up and token costs going down. Both of which are true.

Note: I say “whole file” and not “whole document” because sometimes a document is broken into multiple files by a human author. This Medicare Claims Processing Manual document is one of the larger documents in our test corpora at about 3 million tokens but it’s broken into 39 separate PDF’s so we’ll use BM25 search to first identify the chapters we need to reason over but then we show the model every token from the chapters we selected.

3 Likes

I agree with this whole file(or the full context boundary) as one atomic context to the LLM. I do it using parent retrieval. While matching, I will use smaller chunk but will then recreate the entire context before sending to LLM

3 Likes

That’s essentially what you’re doing with more traditional search algorithms like BM25. I’m definitely not trying to say that search isn’t an important component in all of this. I was on the desktop search team in Windows so I’m a search guy.

You obviously can’t present every single document to the model to reason over so search is the best way to reduce the amount of information you have to show the model. I just try to avoid guessing about which sub-parts of a document are worth showing the model.

I agree with the notion that in many cases you cannot or should not eliminate content from documents as critical information may get removed.

That said, I do believe that for certain types of tasks (e.g. trend analysis across multiple documents), there is a lot of value in pre-structuring and chunking the contents of a document in order to feed the model the content in a way that it can optimally reason over it and derive meaningful conclusions.

1 Like

Very well thought out and explained. I agree wholeheartedly.

However, if I may:

In general, you’re absolutely correct. An approach I have used with my documents is hierarchal/semantic chunking, which attempts to maintain the hierarchal structure of the original document in it’s embeddings. A more technical term is “layout aware” embeddings: Using gpt-4 API to Semantically Chunk Documents - #74 by SomebodySysop

But, in general, cosine similarity searches are going to return chunks according to their ranking or “distance”, and not in the order in which they were embedded.

For the types of documents I am working with, this is perfect. However, I can see the issue when dealing with scientific or engineering types of documents where the order of the explanation of ideas becomes far more crucial.

I have been primarily working with regulatory and philosophical/religious documents. Again, by using the hierarchal/semantic chunking strategy and rendering the “atomic idea” of each chunk, locating similar concepts across multiple documents is exactly what a well structured RAG system is designed to do.

The biggest problem I’ve found in this approach is chunk “noise” – that is, the inclusion of chunks in the similarity results which actually do not relate to the query. I have employed filtering techniques to overcome this challenge.

As for your specific example

Personally, I see that as more of an SQL query than an LLM query. However, if you have only documents from which to pull that information, I can see the difficulties with RAG.

Absolutely agreed. But, there are strategies:

  • https://youtu.be/LNq_2s_H01Y

    • Summarization Methodologies

      • Map Reduce

        • Chunk document. Summarize each chunk, then summarize all the chunk summaries. Using this currently in embed_solr_index01.php.
      • Stuffing

        • Summarize entire document all at once, if it will fit into prompt.
      • Refine

        • Chunk document. Summarize first chunk. Summarize 2nd chunk + 1st chunk summary. Summarize 3rd chunk + 1st and 2nd chunk summary. And so on…

This approach seems like a really good solution to the finding the haystack in a haystack problem. But the systems I am working on, at least for now, are not designed to answer the big sweeping, generalized questions. I am focused on finding those needles in the haystack, and as you said, RAG is still 80%+ good at that.

2 Likes

The issue is this stuff is pretty much NEVER in a SQL database. It’s in a spreadsheet if you’re lucky but most likely its just in a power point.

And there are always exceptions to the rule as in trend analysis. You should definitely find what works for your specific scenario. I’m building a company and technology stack that’s intended to be broadly applicable to a wide range of problems so what I outlined are the principles I’ve been using to guide my decision making process.

I do want to give chunk based RAG credit where credit is due… It does help with the “lost in the middle” problem that long context reasoning suffers from.

One of the core issues with “attention” is that the more spread out facts are the more likely the model is to lose track of those facts. For example if I start off with “the killer is” and then within a few tokens I say “john smith” the model will very reliably answer “john smith” when asked “who was the killer?” As the distance between the predict and subject grows the less likely the model is to make the connection. This issue really shows up when a question requires the correlation of multiple facts to answer, like you ask the model to retrieve multiple needles from a haystack.

The chunking that RAG does can actually help to minimize the lost in the middle problem because it can result in moving the needed facts closer together distance wise. The secret sauce I’ve developed does an even better job because i literally end up moving the needed facts to be right next to each other but chunk based RAG is definitely an improvement.

1 Like

I think the problem lies in the possible double meanings e.g. in the sentence

“The crane can lift heavy loads and is standing in the field”

We got

The [a bird/a machine used for lifting] crane can lift heavy loads and is standing in the [an open area of land/a particular branch of study or interest] field

Labeling the meaning before summarizing might give a little boost.

The [a machine used for lifting] crane can lift heavy loads and is standing in the [an open area of land] field.

This can be done to words, phrases and the whole context.

And since the human brain stores memory in 3 variants why not do the same with an information.

This may not work as well with summarizations as with atomic idea chunks, but another solution to the “double-meaning” issue is to retrieve more context: Is it possible to get "context aware" embeddings? - #9 by curt.kennedy

One strategy I have employed is “small to big” retrieval to grab adjacent chunks according to a pre-defined radius: Is it possible to get "context aware" embeddings? - #10 by SomebodySysop

It depends. Haystack in Haystack (excellent naming BTW) can be transformed in a series of “needle in haystack” and pre/post processing, similar to what humans do (mostly subconsciously) on the first run, and the results of this run can be saved as “conclusion” object in RAG for future use…

2 Likes