I wanted to solicit ideas for a new RAG benchmark I’d like to create. OpenAI recently released SimpleQA which shows just how horrible ungrounded LLMs are at answering fact based questions. The leader, o1-preview, scores a 42.7%. Through proper grounding you can easily get these score up into the 90’s but I’m not aware of any great benchmarks for testing that so I want to build one.
My question for the community is what systems and approaches should be included in the benchmark. My plan is to start with a subset of the SimpleQA dataset and then expand that to conver more complex multi-part questions over time. What systems should be in the initial test bench?
Obvious candidates would be LangGraph and LlamaIndex implementations but what vector store should those systems use and what RAG approaches. I’d prefer to just take off the shelf example projects that require little to no modifications. I’m open to thoughts, suggestions, and even contributors.
If anyone has any links to semi turnkey example RAG projects that would be great. I think long term it should have a variety of different RAG techniques but I feel like a reasonable start would be naive-RAG and then whatever is considered the most capable RAG samples from LangGraph and LlamaIndex.
Full disclosure… We have our own RAG like reasoning engine called Awareness that I’m planning to add to the benchmark. But my ultimate goal is really just to identify what’s the most capable Q&A solution you can assemble regardless of vendor.
Input documents (rather big of a size and multi domain so that we have a huge number of chunks with totally different and very similar meanings).
Predefined questions with both general and precise answers (like dates, numbers, etc) but without need for complex math operations or text transformations (to exclude LLM factor)
Answers to those questions with accepted variations expressed as percent to validate the answers. (Say for exact answers it should be very low, but for general questions higher). Maybe a control question that must result in bool answer to check if RAG engine has the fact nailed.
As evaluation, I would suggest to start with simple true/false + position of the “answer reference” chunk in the prompt sent to answering model. With details or question complexity range.
What do you think?
And yes, we also have a “secret sauce” solution we use as RAG engine at LAWXER (https://www.lawxer.ai)
Great input… I would also say input that varies by type so spreadsheets versus pdfs…
One of the key parts of any RAG system is the integration pipeline and just how good you are at document conversion. From a benchmarking perspective it seems like it would be better to create more level playing field so that all RAG system can be evaluated equally against the exact same ingested corpus. So basically test all systems after the documents have been converted to text.
I think system should be free to publish a separate score which shows their improvement when they ingest documents but they should also be measured equally against all other systems using the exact same corpus.
Ideally this benchmark is similar to SWE-Bench where any RAG provider is free to put their system to the test and compete for a spot on the leader board.
Personally I would love a benchmark that involves a model asking the right question.
I know, this sounds so edgy but I mean, come on fellas. What’s more important, knowing the answer, or asking the right question?
We can easily generate a database of all known legal information. We can desperately try to map user queries to areas of the database, but it’d be a lot easier to just have some sort of adapter.
It seems to me that with LLMs we’ll have all the already known answers in the world. We just need to find our way to them, and further.
that could be a scope issue though. I suspect that you’d first need highly personalized models that understand what a specific user is actually talking about - and that would require knowing what the user knows.
But I think it’s a good reminder for all the capabilities a RAG system is supposed to have:
Is the query expressing the user’s intent?
Is the encoded query expressing what the user expressed?
Is the encoded query readily mappable (or mapping) to all relevant information?
Is all relevant information used to construct a response?
Is the response answering the query?
I don’t know if I want nr.1 to be solved by an on-line LLM tbh but I guess it would be google’s holy grail.
Ooh. yeah. I was thinking what if you’re just a simulation of the real deal so that Amazon can figure out what it can pre-ship to the real Ruckus
But yeah, “asking the real questions” should already be part of your chunking strategy for your knowledge map, no? But I guess it depends on how you do it.
I would put that more as “with LLMs you have the potential for any answer and yes it’s a path finding exercise to come up with the right answer”
With that said, for a lot of RAG tasks (especially enterprise RAG) the sequence of tokens needed to generate a correct answer is not in the models world knowledge and depending on the capabilities of the underlying RAG system it may not ever even make it into the context window.
That’s another thing we could probably measure in a benchmark. How often does a system show a model the actual information it would need to answer the question.
Query rephrasing is definitely a component and I think a lot of advanced RAG systems will use it heavily.
From a pure benchmarking perspective it’s probably going to be challenging to do much more then just say here’s a corpus of text and here’s a bunch of questions with known answers. At least to start…
Definitely don’t want to shut down the brainstorming… please keep the ideas coming
I’m fully convinced that in the future the tasks we pass to the model will not be “single-shot”: One question, one answer.
With agentic workflows we will have a large, complex task that the model will read and like giving an idea along with a unit tests will spend a wild amount of electricity to create and refine.
For those who need assistance retrieving information from a database, well, the model is there to refine their question. Not try to ignorantly find their answer.
You forget that embeddings is based on language AI - AI that knows more than any human already.
When you are working with a customer database, there are untold layers of deeper understanding of the difference between John Doe, Janet Kobayashi, and Stefani Germanotta.
If you are inquiring about the helper methods available for OpenAI Assistants, you need a model that already has the semantic clustering around a particular endpoint and what it does, not ignorance of what someone is talking about.
Therefore, in constructing data, and use-case, it very much shall be real-world. You don’t want to find out how well the AI can understand the qualities of “example data” that arise within your text.
Then, one must decide if we are to evaluate embeddings themselves as rankers, or a total solution that may have multiple layers of chunking and tuning, threshold learning, etc. to optimize delivery into a fixed context window.
Benchmarking RAG to find how well it can answer from your tech support knowledge base, from scientific documents as knowledge, from your SQL logs, or to be a Zoo’s chatbot that never answers wrong, is a “hard” task, and needs the complete application as data, and in vast diversity of complete application, so it isn’t simply a matter of cheating by making a RAG AI that is an expert on 10 specializations of a benchmark.
Benchmarks are far from perfect and cheating (over fitting) are valid concerns… some types of data (spreadsheets) could be dynamically generated which would minimize the ability to overfit the benchmark.
yeah, and I’m saying that this shouldn’t be a goal
I understand that natural language is inextricably linked to socially accepted truths, but I think that this adherence to “factualness” as a safety goal is detrimental to RAG.
I disagree in the sense that I think that this could indeed be possible and quite achievable, but we need “fact-agnostic” models to that end - which currently no one is building as far as I know.
@sergeliatko as part of your benchmarking process, what open source systems do you think are really good at understanding
documents with multiple columns, understands tables, tables that cross page boundaries, and is also actively maintained
This is a good idea – hope you made progress – I am planning on putting up a RAG Leaderboard for RAG-As-A-Service that benchmarks all known RAGaaS like OpenAI Assistants, Pinecone Assistant, CustomGPT.ai, Cohere, etc – using benchmarks like HotspotQA, SimpleQA, etc … (Idea is to make this similar to how Chatbot Arena has its leaderboard for LLMs)
But my ultimate goal is really just to identify what’s the most capable Q&A solution you can assemble regardless of vendor.
Why not just benchmark the vendor and let the vendor deal with all the 1000 decisions needed to optimize the RAG? (this is what we do with LLMs too, no? Do we really care about all the things OpenAI does internally?)
This sounds like a great initiative to improve the evaluation of RAG systems! Starting with LangGraph and LlamaIndex is a smart choice, and using FAISS or Pinecone for vector stores will be a solid foundation. For more details, feel free to visit my website at civilidchecks. com kuwait-civil-id-address-check . Looking forward to seeing the results!