RAG is not really a solution


After spending more than a year now with Gen AI, I feel RAG is more of a problem than a solution. it is so brittle and there is no science to it. I wanted to check with this group if anyone is aware of any other technique or ways to work more efficiently in this space. One of the things, I was expecting to see in this space is similar to how automated hyper parameter tuning happens in traditional ML. The idea is, if I provide a set of retrieval techniques and a loss function, is there any product/process that will run the different retrieval techniques, calculate the loss and automate the identification of the best technique. I could not find any such product so far.


@stevenic has done quite a bit, but he’s busy with a startup at the moment, I believe. If you search the forum, we’ve got a lot of great hidden gems / wisdom.

Hrm… I wonder if someone would want to create a massive “here’s all the best RAG posts” for the forum…


After a year of working on my RAG system, I could not disagree more heartily.
Here are a couple of information sources I have found useful:

No idea what you’re talking about. But if that is your use requirement, I can certainly see your concern.

I just want to make the point that I have found RAG to be extremely useful in my use case: Creating keyword and semantic searchable knowledgebases consisting of thousands of documents and focused on very specific areas of interest.

Biggest problem I’ve ran into so far: Some query responses not comprehensive enough. End-users can almost always get to a complete answer using chain-of-thought queries (few-shot). But, the end-users I’ve been working with want complete answers on the first question (zero-shot). This may touch on your issue.

My resolution: Deep Dive. Have the model dig through all the possible responses, categorize and analyze those, then return a complete list of best responses. Since I built my RAG system myself, I also have to build this feature. So I’m thinking, whatever you say this technique is you’re missing, you may have to build it yourself.

I have felt this way. A lot. But the truth is, so much of it is trial and error and fine-tuning the techniques that work best for your use case. And perhaps, in the end, that is the science of it all.


I find RAG extremely useful for giving the model more context. My use case is using LLM as a classifier on the text data, that not just uses the ML techniques, but “know” something about what these texts are about. E.g.I have a dataset with texts and associated numbers (encoded expense accounts for example). So, I receive the new text and I need to find the number for him. Classical classifiers showed accuracy 0.6-0.7. RAG been supplied with 5-10 most similar texts (based on embeddings+cosine similarity) with numbers, so the prompt is formed dynamically every time, and returns the numbers for new lines with accuracy 0.8-0.9.
I assume on top of classical classifier “reads” the texts and that gives him more understanding of what he’s returning vs. classical model.
So, my “technique” here is to tune the prompt every time i send it and monitor the accuracy afterwards.


Thanks a lot, are you using ada embedding. I found issues with embedding also where it is not returning the same embedding every time. I converted to encoding_mod =float, that also is not 100% consistent. I wanted to try with cohere and see if that is better than ada

1 Like

Yes. But I use it via the Weaviate tex2vec-openai transformer. In my experience, I have received fairly consistent cosine similarity results. text-embedding-ada-002


Thanks @PaulBellow

I will say that RAG is indeed useful and in fact it’s the key to grounding the model and giving it memory. I’m assuming your task @joyasree78 is Q&A related and what you have to realize is that task is simply a compression problem.

The model will actually do a great job of answering almost any question but it needs to see the text with the answer in its context window. If you’re getting poor answers you’re likely not showing the model text that contains the answer.

I could go on and on about the flaws with current RAG techniques (I’m building a company to address them) but what I’d suggest is to look at the queries where the model came back with the wrong with a bad answer? Was the answer in the prompt text? The model sometimes misses things. It’s rare but they’re not perfect.

More often than not you’ll find that you’re simply not showing the model the text that contains the answer and this is when has to guess (hallucinate). The model always wants to answer you. That’s both it’s strength and it’s weakness because it’s a general purpose model that’s trying to cover a wide range of scenarios.


Just saw this, so very late reply. IMHO ‘traditional’ RAG is great for answering short questions when the answer is buried in a single chunk.
Where it fails is in providing an ‘integrated view’ of a topic or field from a collection of sources. I believe next-gen Augmented LLMs will build more sophisticated background stores, not relying solely on embedding vectors of original document text segments. My current architecture uses a lattice of document/document-section clusters, with relevant abstractions of information at each node, down to actual individual document-sections. Anyone know of others working on things like this?


Not quite the same thing, but I’m using my scaffolding to create networks of AI guides that call each other as functions, and dividing up large documents so that each guide has only one piece of the context, plus instructions on which guides to call for more information. Early tests focused on extremely simple examples—specialized guides have the “secret word” or “secret number,” and the root guide knows about them and can call on them.

For the rest of this week, I’ll be testing with a relatively short book, broken into chapters, with each “branch” guide having access to just one chapter and the root guide having access to the table of contents, index, and introduction, and a list of keywords as part of the description for each chapter. I’ll whip up a specialist guide that is focused on generating keywords automatically.

The theory here is that any one guide would get overwhelmed by too much context, so having a network of guides that each have limited context, connected by a root guide that knows how to call for support, will outperform a single guide that has the full context, even if it should theoretically fit within its context window.


Here is my take on it. The intention behind RAG is excellent - Use LLM to answer questions on the external data it was not trained on. In fact, I have implemented a quick toy RAG based chatbot using LlamaIndex, fed four PDF documents that were definitely used for training the LLM (gemini-pro) and the chatbot answered five out of seven questions correctly.
It couldn’t answer one question and gave a confident wrong answer (hallucination) to another. When I increased chunk size, it replied correctly to the first question the chatbot didn’t answer earlier. It answered even the second question when the question was slightly rephrased. When it was modified further, it gave wrong answer. In my use case of Q & A, the chatbot was unreliable. I prefer chatbot saying cannot answer instead of giving wrong answer. I don’t think, I want to put such unreliable chatbot in production. Though the RAG concept and intention behind it is quite useful, it is a problem if we can’t make products built on it hundred percent reliable.

1 Like

LLMs don’t work deterministically though, so this is likely never going to happen.

What I suspect will happen instead, is users will get used to this behaviour and even expect it so they will learn to continue to query.


That’s an incredibly unrealistic standard, regardless of the tools you’re working with.

I doubt you, yourself, are even 100% reliable, why would you expect a piece of software to be.

1 Like

To see what’s going on, you’d have to examine the prompt that was retrieved/formed. There are a lot more details you need to get right.

First, I would not advise PDF’s as your main data source. These are notorious for being parsed incorrectly. If you have no choice, then you need to pour over the chunks being indexed, to ensure they make sense.

Second, did you give the model an out? Like “Your answer will be ‘I don’t know’ if the context provided does not answer the question”.

Third, you’ve already shown chunk size matters. It could be that some of the key chunks are incorrectly fragmented, and you need to do more than rely on a simple chunking algorithm to work for you. Chunks should form at least one “complete thought”.

So start with examining your prompt. Does it look reasonable? Things getting chopped? Is there an out? Then work backwards.

This may require getting your hands dirty, and ultimately not using LlamaIndex, and coding this directly yourself if LlamaIndex doesn’t have the right tuning knobs.


I have been working with fairly traditional RAG implementations for several months now. There has rarely been a case where a question that was correctly formatted was not answered, and never a case of hallucination. But, this doesn’t happen by magic – a lot of work has gone into developing a RAG system that generates consistently good results.

First off, do everything @curt.kennedy suggests.

You can’t just chop up text any old kind of way and then expect intelligent answers. I personally use my own version of Semantic Chunking https://www.youtube.com/watch?v=w_veb816Asg&t=1s&pp=ygUOc2VtYW50aWMgY2h1bms%3D

But, if you do a YouTube search for “Semantic Chunking”, you will find a lot of different tutorials on how to chunk your data in ways that will make it far more useful in your queries.

Also, when you say it answered 5 out of 7 questions, what were the two questions it did NOT answer? And, what makes you say they weren’t answered or that they could be answered in the given text? You have to look at both the EXACT phraseology of the question asked, and the exact text you expected to answer it. You should probably brush up on your Prompt Engineering 101, but I’ll bet you that either your question wasn’t specific enough (based upon the actual text present), or your chunking split the idea of the text in a way that it did not match the cosine similarity search in the way you felt it should. Again, your chunking method.

Finally, was the question specific or general? RAG systems, by definition, are excellent at needle in haystack types of queries – typically who, what, when, where? They begin to suck when it comes to sweeping questions and those which involve how, and to a greater extent, why?

I have found that a lot of people think, because these machines seem to be so human like in their responses, that they actually “think” about the data they are pontificating on. They don’t. They are simply regurgitating text. They aren’t human, so they don’t posses human intuition – they can’t reason. So when you ask it a vague, sweeping question that is not specifically answered in one or more chunks that are returned to it, it’s going to either hallucinate, or tell you it can’t answer the question (if you have written your system prompt correctly).

All this to say that RAG, even simple RAG, is far more than dumping PDFs into a vector store and asking questions with little thought behind how the model is actually capable of responding.


Though my experiment with RAG is still limited, I would agree the pre-processing step or “data chunking” is very important. And it seems time consuming to verify each chunk is semantically sound. It would be nice if there’s a proven algorithm that is able to scan a directory and to ensure each chunk/file is a single semantic unit, if not make corrections… maybe it’s too much to ask…

1 Like

Theoretically, you could use a model to do this. Right now, I take my semantic chunks and simply chunk them further if they exceed my chunk limit.

However, another approach would be to use a model to examine the chunk and have it determine where the cuts should be made. This, however, would be far more time-consuming and expensive – but would also result is semantically perfect chunks.

1 Like

This is there to some extent with semantic chunking which is done using embeddings.

1 Like

One may have multiple PDFs or Word documents or many of them.
Relevant libraries can easily extract these documents into plain text files, however, they break off at each page and occasionally a whole word into half. But semantically each page does not necessarily form one semantic unit. Hence, some additional data pre-processing work would be necessary. And if we have 1000 or more such text files, going over each of them manually would be extremely time-consuming. That’s the rationale for my initial query for an automated and reliable algorithm to address the issue.


Right. So, what do you think of my idea of having an LLM do this?

1 Like

A special purpose tiny language model could be a solution.

Separate from data pre-processing, there’s an interesting technique to increase semantic similarity match, willing to share with you and a few others via pm if interested (and probably you’ve already experimented with it as well because it’s a logical progression).

[edit] and it would be more efficient.