How should we evaluate hallucinations in RAG systems when semantically similar context may still be irrelevant or incorrect, and the real failure may lie in retrieval or source quality, not the model itself?

In RAG, if a model “hallucinates” because the retrieved context is irrelevant or factually wrong, is that the model’s fault, or the retriever’s, or the document source itself? And more importantly, can retrieved context be semantically similar but still irrelevant for answering the specific question?

If so, how should we rethink the evaluation of hallucinations in such cases?

I’m currently building a playground tool for exploring the retrieval layer in RAG pipelines, to help developers and researchers understand and analyze what can actually be retrieved with pure dense embeddings like the ones text-embedding models use (like text-embedding-3-small) .

Let me know what you think.

1 Like

Hallucinations in RAG setups can come from several interacting factors, and not all of them are the model’s fault. Even if the retrieved context is semantically similar to the query, it can still be irrelevant or misleading for answering the specific question. In that case, the “root cause” may lie in the retrieval step, the quality of the source material, or even in how the task itself is formulated.

Some common contributors include:

Retrieval bringing in irrelevant or noisy chunks because semantic similarity alone doesn’t guarantee usefulness for the exact question. (Are you matching the question or trying to understand what the answer would look like before using vector search?)

Overloading the prompt with too many chunks, which dilutes the relevant context and increases the chance the model fills gaps with guesswork.

Vague or overly complex task instructions that make it harder for the model to focus.

An LLM not adapted to the domain or lacking a way to decline when the answer can’t be confidently produced.

To reduce these issues, it helps to treat hallucination control as a workflow design problem:

1. Start by clearly defining the expected output and working backward to identify exactly what information is required to produce it.

2. Build retrieval filters — possibly using an LLM or other heuristics — to validate that each chunk is relevant and useful before adding it to the prompt.

3. Keep prompts concise and focused, avoiding unnecessary noise.

4. Run post-processing to check the answer against the provided context and flag or remove contradictions, optionally verifying facts from other trusted sources.

By looking at hallucinations this way, you can separate model behavior from retrieval and source quality issues, and you can design your pipeline so that each stage — retrieval, filtering, prompting, and verification — works together to keep the model’s output grounded.