How should we evaluate hallucinations in RAG systems when semantically similar context may still be irrelevant or incorrect, and the real failure may lie in retrieval or source quality, not the model itself?

In RAG, if a model “hallucinates” because the retrieved context is irrelevant or factually wrong, is that the model’s fault, or the retriever’s, or the document source itself? And more importantly, can retrieved context be semantically similar but still irrelevant for answering the specific question?

If so, how should we rethink the evaluation of hallucinations in such cases?

I’m currently building a playground tool for exploring the retrieval layer in RAG pipelines, to help developers and researchers understand and analyze what can actually be retrieved with pure dense embeddings like the ones text-embedding models use (like text-embedding-3-small) .

Let me know what you think.

1 Like