After spending more than a year now with Gen AI, I feel RAG is more of a problem than a solution. it is so brittle and there is no science to it. I wanted to check with this group if anyone is aware of any other technique or ways to work more efficiently in this space. One of the things, I was expecting to see in this space is similar to how automated hyper parameter tuning happens in traditional ML. The idea is, if I provide a set of retrieval techniques and a loss function, is there any product/process that will run the different retrieval techniques, calculate the loss and automate the identification of the best technique. I could not find any such product so far.
@stevenic has done quite a bit, but heâs busy with a startup at the moment, I believe. If you search the forum, weâve got a lot of great hidden gems / wisdom.
Hrm⌠I wonder if someone would want to create a massive âhereâs all the best RAG postsâ for the forumâŚ
After a year of working on my RAG system, I could not disagree more heartily.
Here are a couple of information sources I have found useful:
- https://www.youtube.com/live/Y9qn4XGH1TI?si=iUs_x3yDL8BK7aUb
- [2312.10997] Retrieval-Augmented Generation for Large Language Models: A Survey
No idea what youâre talking about. But if that is your use requirement, I can certainly see your concern.
I just want to make the point that I have found RAG to be extremely useful in my use case: Creating keyword and semantic searchable knowledgebases consisting of thousands of documents and focused on very specific areas of interest.
Biggest problem Iâve ran into so far: Some query responses not comprehensive enough. End-users can almost always get to a complete answer using chain-of-thought queries (few-shot). But, the end-users Iâve been working with want complete answers on the first question (zero-shot). This may touch on your issue.
My resolution: Deep Dive. Have the model dig through all the possible responses, categorize and analyze those, then return a complete list of best responses. Since I built my RAG system myself, I also have to build this feature. So Iâm thinking, whatever you say this technique is youâre missing, you may have to build it yourself.
I have felt this way. A lot. But the truth is, so much of it is trial and error and fine-tuning the techniques that work best for your use case. And perhaps, in the end, that is the science of it all.
I find RAG extremely useful for giving the model more context. My use case is using LLM as a classifier on the text data, that not just uses the ML techniques, but âknowâ something about what these texts are about. E.g.I have a dataset with texts and associated numbers (encoded expense accounts for example). So, I receive the new text and I need to find the number for him. Classical classifiers showed accuracy 0.6-0.7. RAG been supplied with 5-10 most similar texts (based on embeddings+cosine similarity) with numbers, so the prompt is formed dynamically every time, and returns the numbers for new lines with accuracy 0.8-0.9.
I assume on top of classical classifier âreadsâ the texts and that gives him more understanding of what heâs returning vs. classical model.
So, my âtechniqueâ here is to tune the prompt every time i send it and monitor the accuracy afterwards.
Thanks a lot, are you using ada embedding. I found issues with embedding also where it is not returning the same embedding every time. I converted to encoding_mod =float, that also is not 100% consistent. I wanted to try with cohere and see if that is better than ada
Yes. But I use it via the Weaviate tex2vec-openai transformer. In my experience, I have received fairly consistent cosine similarity results. text-embedding-ada-002
Thanks @PaulBellowâŚ
I will say that RAG is indeed useful and in fact itâs the key to grounding the model and giving it memory. Iâm assuming your task @joyasree78 is Q&A related and what you have to realize is that task is simply a compression problem.
The model will actually do a great job of answering almost any question but it needs to see the text with the answer in its context window. If youâre getting poor answers youâre likely not showing the model text that contains the answer.
I could go on and on about the flaws with current RAG techniques (Iâm building a company to address them) but what Iâd suggest is to look at the queries where the model came back with the wrong with a bad answer? Was the answer in the prompt text? The model sometimes misses things. Itâs rare but theyâre not perfect.
More often than not youâll find that youâre simply not showing the model the text that contains the answer and this is when has to guess (hallucinate). The model always wants to answer you. Thatâs both itâs strength and itâs weakness because itâs a general purpose model thatâs trying to cover a wide range of scenarios.
Just saw this, so very late reply. IMHO âtraditionalâ RAG is great for answering short questions when the answer is buried in a single chunk.
Where it fails is in providing an âintegrated viewâ of a topic or field from a collection of sources. I believe next-gen Augmented LLMs will build more sophisticated background stores, not relying solely on embedding vectors of original document text segments. My current architecture uses a lattice of document/document-section clusters, with relevant abstractions of information at each node, down to actual individual document-sections. Anyone know of others working on things like this?
Not quite the same thing, but Iâm using my scaffolding to create networks of AI guides that call each other as functions, and dividing up large documents so that each guide has only one piece of the context, plus instructions on which guides to call for more information. Early tests focused on extremely simple examplesâspecialized guides have the âsecret wordâ or âsecret number,â and the root guide knows about them and can call on them.
For the rest of this week, Iâll be testing with a relatively short book, broken into chapters, with each âbranchâ guide having access to just one chapter and the root guide having access to the table of contents, index, and introduction, and a list of keywords as part of the description for each chapter. Iâll whip up a specialist guide that is focused on generating keywords automatically.
The theory here is that any one guide would get overwhelmed by too much context, so having a network of guides that each have limited context, connected by a root guide that knows how to call for support, will outperform a single guide that has the full context, even if it should theoretically fit within its context window.
Here is my take on it. The intention behind RAG is excellent - Use LLM to answer questions on the external data it was not trained on. In fact, I have implemented a quick toy RAG based chatbot using LlamaIndex, fed four PDF documents that were definitely used for training the LLM (gemini-pro) and the chatbot answered five out of seven questions correctly.
It couldnât answer one question and gave a confident wrong answer (hallucination) to another. When I increased chunk size, it replied correctly to the first question the chatbot didnât answer earlier. It answered even the second question when the question was slightly rephrased. When it was modified further, it gave wrong answer. In my use case of Q & A, the chatbot was unreliable. I prefer chatbot saying cannot answer instead of giving wrong answer. I donât think, I want to put such unreliable chatbot in production. Though the RAG concept and intention behind it is quite useful, it is a problem if we canât make products built on it hundred percent reliable.
LLMs donât work deterministically though, so this is likely never going to happen.
What I suspect will happen instead, is users will get used to this behaviour and even expect it so they will learn to continue to query.
Thatâs an incredibly unrealistic standard, regardless of the tools youâre working with.
I doubt you, yourself, are even 100% reliable, why would you expect a piece of software to be.
To see whatâs going on, youâd have to examine the prompt that was retrieved/formed. There are a lot more details you need to get right.
First, I would not advise PDFâs as your main data source. These are notorious for being parsed incorrectly. If you have no choice, then you need to pour over the chunks being indexed, to ensure they make sense.
Second, did you give the model an out? Like âYour answer will be âI donât knowâ if the context provided does not answer the questionâ.
Third, youâve already shown chunk size matters. It could be that some of the key chunks are incorrectly fragmented, and you need to do more than rely on a simple chunking algorithm to work for you. Chunks should form at least one âcomplete thoughtâ.
So start with examining your prompt. Does it look reasonable? Things getting chopped? Is there an out? Then work backwards.
This may require getting your hands dirty, and ultimately not using LlamaIndex, and coding this directly yourself if LlamaIndex doesnât have the right tuning knobs.
I have been working with fairly traditional RAG implementations for several months now. There has rarely been a case where a question that was correctly formatted was not answered, and never a case of hallucination. But, this doesnât happen by magic â a lot of work has gone into developing a RAG system that generates consistently good results.
First off, do everything @curt.kennedy suggests.
You canât just chop up text any old kind of way and then expect intelligent answers. I personally use my own version of Semantic Chunking https://www.youtube.com/watch?v=w_veb816Asg&t=1s&pp=ygUOc2VtYW50aWMgY2h1bms%3D
But, if you do a YouTube search for âSemantic Chunkingâ, you will find a lot of different tutorials on how to chunk your data in ways that will make it far more useful in your queries.
Also, when you say it answered 5 out of 7 questions, what were the two questions it did NOT answer? And, what makes you say they werenât answered or that they could be answered in the given text? You have to look at both the EXACT phraseology of the question asked, and the exact text you expected to answer it. You should probably brush up on your Prompt Engineering 101, but Iâll bet you that either your question wasnât specific enough (based upon the actual text present), or your chunking split the idea of the text in a way that it did not match the cosine similarity search in the way you felt it should. Again, your chunking method.
Finally, was the question specific or general? RAG systems, by definition, are excellent at needle in haystack types of queries â typically who, what, when, where? They begin to suck when it comes to sweeping questions and those which involve how, and to a greater extent, why?
I have found that a lot of people think, because these machines seem to be so human like in their responses, that they actually âthinkâ about the data they are pontificating on. They donât. They are simply regurgitating text. They arenât human, so they donât posses human intuition â they canât reason. So when you ask it a vague, sweeping question that is not specifically answered in one or more chunks that are returned to it, itâs going to either hallucinate, or tell you it canât answer the question (if you have written your system prompt correctly).
All this to say that RAG, even simple RAG, is far more than dumping PDFs into a vector store and asking questions with little thought behind how the model is actually capable of responding.
Though my experiment with RAG is still limited, I would agree the pre-processing step or âdata chunkingâ is very important. And it seems time consuming to verify each chunk is semantically sound. It would be nice if thereâs a proven algorithm that is able to scan a directory and to ensure each chunk/file is a single semantic unit, if not make corrections⌠maybe itâs too much to askâŚ
Theoretically, you could use a model to do this. Right now, I take my semantic chunks and simply chunk them further if they exceed my chunk limit.
However, another approach would be to use a model to examine the chunk and have it determine where the cuts should be made. This, however, would be far more time-consuming and expensive â but would also result is semantically perfect chunks.
This is there to some extent with semantic chunking which is done using embeddings.
One may have multiple PDFs or Word documents or many of them.
Relevant libraries can easily extract these documents into plain text files, however, they break off at each page and occasionally a whole word into half. But semantically each page does not necessarily form one semantic unit. Hence, some additional data pre-processing work would be necessary. And if we have 1000 or more such text files, going over each of them manually would be extremely time-consuming. Thatâs the rationale for my initial query for an automated and reliable algorithm to address the issue.
Right. So, what do you think of my idea of having an LLM do this?
A special purpose tiny language model could be a solution.
Separate from data pre-processing, thereâs an interesting technique to increase semantic similarity match, willing to share with you and a few others via pm if interested (and probably youâve already experimented with it as well because itâs a logical progression).
[edit] and it would be more efficient.