How is everyone thinking about RAG Evolution in this new era of Reasoning Models ?
Up until this point, I had a working hypothesis that RAG got us to 80% accuracy with a basic non reasoning LLM. However, as of this week, I am back to the drawing board, re-evaluating the techniques with the onset of the reasoning models. If your knowledge bot type of use cases are basically just processing public data , wouldn’t a powerful reasoning model prompt it’s way through it ?
I think the main problem with RAG is that an embedding model can catch semantically close parts but will not catch the subtle nuances.
That’s why many people love gpt-4.5-preview: it’s not a cheap vector search, but a very complex model that has a lot of parameters and requires lots of expensive GPUs.
Who knows what the future is? Things are changing very fast…
I don’t think it is possible to put a percentage on the accuracy of RAG in general because the accuracy depends so much on the domain, the coverage of your corpus, and the accuracy of the documents in your corpus. E.g., if you want a system that uses the point of view that climate change is a Chinese conspiracy, you could create a corpus with a whole bunch of web pages and articles from that side of the political spectrum and you would get questions that were “accurate” from that point of view but not from the viewpoint of science and the real world.
The accuracy is based on many factors. If the model requests the data using the right keywords, it will retrieve the most suitable subset from your data store to evaluate.
But the retrieval process is not using natural language. Once the pools of records are retrieved from RAG, only then will the model search the resulting subset using natural language and its powerful comprehension skills.
The weak link is the RAG retrieval process. Allowing the LLM to scan your entire data store is too expensive. So you’re at the mercy of the RAG retrieval model. Google uses a tuned model called Gecko, which is disgustingly fast, but often misses pertinent records out of the selection pool because the keywords and semantics aren’t recognized as matches. But it’s still 50 times faster than OpenAI’s RAG.
It’s an imperfect system. You need to determine your priority… SPEED or ACCURACY. I went with speed, because neither are that accurate. But it’s still far better than letting the LLM fabricate nonsense out of thin air, which it loves to do.
I understand and agree that putting an arbitrary number on accuracy is not the right way, we have to start at a reasonable baseline because in certain domains like state and local governments , getting to accuracy is key and the challenge is the customers evolving expectations that LLMs should always be accurate for factual domain specific answers. Evals for these individual entities is an additional challenge. A year ago perhaps the theory was to not default to an LLM, however, as LLMs have introduced reasoning, is the default shifting towards LLMs ?
Yes the default semantic search offered by many vector databases is proving out to be really basic, you are correct, it doesn’t use NLP on the question matching , for example : What vegan foods do you recommend ? vs What are some vegan diet recommendations ? produce different embeddings with different retrieval documents. So the intermediate technique is prompt engineering (user prompt) effectively. If you don’t mind sharing, how do you optimize for Speed on the retrieval part ? For us on the LLM part if we use Streaming, we lose out on the Structured Outputs capability that our application needs. On the retrieval part, really adjusting “k” is our only lever with minimal impact on answer outputs.
Reasoning models don’t have any special magic. It’s basically just running a “council” of normal generative models, building a context that is more friendly to the generative nature of LLMs.
The less your RAG depends on a model, the better it will work. The model is just the interface.
To optimize for speed, you can try Google Vertex AI. I set up a collection of 10,000 JSON object in an Open AI vector store. Each one-sentence request became a 17,000 token request and took 40 seconds to complete.
I uploaded the same two JSON files to Google’s Cloud Storage and set up a Vector AI Search Data Source and let it index them. Google doesn’t charge tokens for the LLM request. It charge a few cents for storage. The searches take about .5 to 1.7 seconds on average.
We’re still in testing, but our Open AI it hits about 70-80% accuracy. There are still a few records it skips, despite the LLM supposedly making the selections at the RAG level.
With Google, we’re seeing about 5% more “omissions” that should be considered due to the selection method. If you type in “blackwater tank” the chat model knows this is a waste device and could locate regulations regarding sewage waste tanks, whereas the RAG model isn’t designed for natural language comprehension and would miss some potential matches because of the nomenclature in the vector storage data.
But for us, the speed far outweighs the 5-10% omission rate.
The more you work with “AI” the more you realize just how unsuitable it is for real-world enterprise solutions. Between the speed, accuracy and reliability (hallucinations, etc), it’s more than just a frustration… it’s a liability.
Closing this discussion for now with these points based on my experiments and research :
- Future of RAG as a stand alone technique is debatable given the complexity in continuous chunking,embedding and retrieval and reasoning models do challenge the notion of RAG over publicly accessible data - ex: Investor Reports of Publicly Traded Companies. Information Retrieval/Search with RAG is brittle.
- RAG is still effective (cost + relevancy/accuracy - hallucination reduction) for use cases that require the same proprietary knowledge base to be indexed across all departments or all functions of an organization. This is again for the time being. For reference : this is a good in depth discussion on RAG → #037 Chunking for RAG: Stop Br… - How AI Is Built - Apple Podcasts
We’re in agreement. Not just RAG, but LLM in general. My instruction sets are 10x longer than my prompts and responses. Every “AI” API returns tons of data in the response that 99% of people are not interested in. Likely because it means TOKENS which EQUAL $$$.
OpenAI for instance returns 800 delta events for a few sentences of text. This is USELESS information. Only the final response is useful. They can easily add an input parameter to subscribe to the events (in an array) that you want returned. But no thought went into any of it. Not RAG, not responses, almost nothing. It was a rush to market for a product that can only be used for 1% of enterprise organizations.
Google has done a nice job with their bastardized version of RAG. I don’t want to request records from RAG and then send them to the LLM. Google integrated this so that you make a request to the LLM and it generates keywords and parameters which it passes to Gecko (RAG model) which doesn’t have the natural language search capabilities, but is ridiculously fast at retrieving data based on keywords. So both models are leveraged, the LLM generates the widest range of keywords and parameters and the RAG model retrieves the records.
Then Gecko passes the records back to Gemini to perform whatever tasks you instruct it to do.
I have 40 instances of “Request Engines” which send 200-300 requests containing chucks (array of JSON objects). Each object contains some IDs for reference to update information in our database, and some text. The objects are batched into chunks of about 1000 characters. The system appends the serialized JSON objects until the length exceeds 1000 characters. Each item in the array is then processed by Gemini.
Gemini rewrites the provided text and if the text contains an observation that is in violation of a Code of Federal Regulations, it cites the regulation, VERBATIM, with the Code and relevant section.
Gemini’s context was completely unreliable, so we uploaded the 10,000 CFRs into two JSON files which we loaded into Vertex Cloud Storage and then created a Data Store which is indexed. If Gemini determines it needs CFR information, it refers to our data and not its context.
40 instances making 40 requests asynchronously containing up to 1000 characters of JSON objects in an array. As each engine receives a response it updates the database for each item in the array of the response and then shoots out another request, until all 200 - 300 requests are done processing. The entire process takes 12 seconds. And the CFR data is 100% accurate.
It could be faster if Google didn’t return the Grounding Source data (RAG Results) as part of the response. These are often far larger than the chuck containing the array of JSON objects. And if Google created an instruction cache, so I didn’t have to post the instructions 300 times, it could be even FASTER. It stores my CFR data… why not my instructions and settings like an assistant?
I would like to test Azure and see if they faster retrieval or better features. But so far, Google is extremely fast and accurate. The objective is to only use the LLM for what it is good for… LANGUAGE. It excels at grammar, spelling and punctuation, as well as entire rewriting of descriptive text. It’s context (memory), although broad, is unreliable with regard to specific things. Steer clear of it. This limits the usage of the LLM. But currently, that’s all it’s useful for.
Go with a hybrid approach.
I love it, I made mine from scratch. RAG with LSH and real-time indexing and multiple types of text searching algorithms as fallbacks.
single and sliding window for embedding using FastBert and a ONNX model. Soon I’ll be fully implementing my custom KG.