Bhaskara Reddy S
Failure Points In RAG Systems
Anyone who has tried to deploy an RAG system knows that there are several failure modes to watch out for
While RAG helps you reduce hallucinations and create custom ChatLLM, there can be several failure points, given the complexity of the system.
The paper “Seven Failure Points When Engineering a Retrieval Augmented Generation System” (link in alt) makes a comprehensive list of places where failure can occur.
Here is the list of failure points they present and note on each point based on real-world use cases.
Missing Content - This is one of the biggest issues in real-world production cases. A user assumes that the answer to a particular question exists in the knowledge base. It doesn’t, and the system doesn’t respond with “I don’t know.”. Instead, it presents a plausible wrong answer, which can be extremely frustrating.
Missed the Top Ranked Documents - Retrievers are mini-search systems and non-trivial to get right. A simple embedding look-up rarely does the trick. Sometimes, the right answer is not present in the top K documents returned by the retriever, causing failure.
Not in Context - Sometimes, you may get too many documents back, and you have to trim the documents you send in context. This means the response to the question is not in the context. Sometimes, this causes the model to hallucinate unless the system prompt explicitly instructs the model not to return results that are not in context.
Not Extracted - When the LLM fails to extract the answer from the context. This tends to be an issue when you stuff the context and the LLM gets confused. Different LLMs have different levels of context understanding as well.
Wrong Format - While the paper lists this as a failure mode, this type of functionality doesn’t really come out of the box with LLMs. You have to do a lot of system prompting and write code in order to generate the information in a certain format. If this is an important feature, it will require software development and testing. For example, using Abacus AI, you can create an agent to output code in a certain format and generate Word docs with tables, paragraphs, bold text, etc.
Incorrect Specificity - This happens when the response is not at the right level of specificity, and the answer can be very vague or high level. Typically, a follow-up question solves this problem. The other reason this failure can happen is if the answer is in a different
Overall, this means that RAG systems must be tested thoroughly for robustness before putting them into production, and you can easily shoot yourself in the foot by releasing an agent or a chatbot that hasn’t gone through beta testing.
Read paper : https://arxiv.org/abs/2401.05856
In prompt engineering, seeing what works, and adjusting to make the model and conducive to the query and honestly easier for it to find the right answer.
I don’t think I can bear this anymore. Over the past month or so, GPT has been driven crazy by the knowledge they have been trying to force GPT to use, which has ruined the art of using knowledge from being able to be used as a whole process. Determining priority from storage status There are also problems with enforcing knowledge. In the end, I recently had an email conversation with the team and talked about the problem that they were too clingy. until creating usage problems I chose to find a solution. I removed knowkedge and put the text on the web page, adding links to it as knowledge. Many of the problems disappeared. In addition to the problem of knowledge, finally someone will do some research so we can stop being crazy.
But in the end, OpenAI came to block the use of links… What’s infuriating is that he knows the problem. He knows I have a solution. But he turned it off. How infuriating.