How should we evaluate hallucinations in RAG systems when semantically similar context may still be irrelevant or incorrect, and the real failure may lie in retrieval or source quality, not the model itself?

Alfred_Michael · June 26, 2025, 2:25pm

In RAG, if a model “hallucinates” because the retrieved context is irrelevant or factually wrong, is that the model’s fault, or the retriever’s, or the document source itself? And more importantly, can retrieved context be semantically similar but still irrelevant for answering the specific question?

If so, how should we rethink the evaluation of hallucinations in such cases?

I’m currently building a playground tool for exploring the retrieval layer in RAG pipelines, to help developers and researchers understand and analyze what can actually be retrieved with pure dense embeddings like the ones text-embedding models use (like text-embedding-3-small) .

Let me know what you think.

sergeliatko · August 13, 2025, 3:29pm

Hallucinations in RAG setups can come from several interacting factors, and not all of them are the model’s fault. Even if the retrieved context is semantically similar to the query, it can still be irrelevant or misleading for answering the specific question. In that case, the “root cause” may lie in the retrieval step, the quality of the source material, or even in how the task itself is formulated.

Some common contributors include:

Retrieval bringing in irrelevant or noisy chunks because semantic similarity alone doesn’t guarantee usefulness for the exact question. (Are you matching the question or trying to understand what the answer would look like before using vector search?)

Overloading the prompt with too many chunks, which dilutes the relevant context and increases the chance the model fills gaps with guesswork.

Vague or overly complex task instructions that make it harder for the model to focus.

An LLM not adapted to the domain or lacking a way to decline when the answer can’t be confidently produced.

To reduce these issues, it helps to treat hallucination control as a workflow design problem:

1. Start by clearly defining the expected output and working backward to identify exactly what information is required to produce it.

2. Build retrieval filters — possibly using an LLM or other heuristics — to validate that each chunk is relevant and useful before adding it to the prompt.

3. Keep prompts concise and focused, avoiding unnecessary noise.

4. Run post-processing to check the answer against the provided context and flag or remove contradictions, optionally verifying facts from other trusted sources.

By looking at hallucinations this way, you can separate model behavior from retrieval and source quality issues, and you can design your pipeline so that each stage — retrieval, filtering, prompting, and verification — works together to keep the model’s output grounded.

jonasmueller · August 22, 2025, 1:10am

I built a tool to automatically catch incorrect responses from any RAG system (via real-time LLM uncertainty estimation algorithms), which subsequently helps you root cause why the RAG response was untrustworthy (i.e. was the user’s query tricky, was the retrieved context insufficient, etc):

Hope you find it useful!

Topic		Replies	Views
Measuring hallucinations in a RAG pipeline Community hallucinations , api-hallucinations	3	1295	September 29, 2024
Addressing confabulation Community	2	787	June 10, 2021
Why is my fine-tuned model hallucinating? Community fine-tuning	2	2265	October 6, 2023
Retrieval-augmented generation (RAG)/endpoint assist tools API gpt-4 , chatgpt , api , lost-user , assistants-api	1	220	March 3, 2025
Joint Retrieval and Generation Training for Grounded Text Generation Community	6	1075	December 14, 2023

How should we evaluate hallucinations in RAG systems when semantically similar context may still be irrelevant or incorrect, and the real failure may lie in retrieval or source quality, not the model itself?

Related topics