Hallucination in retrieval augmented chatbot (RAG)

Our use case: Retrieval-based chatbot that answers users’ queries based on a knowledge base.

Issue: The content has generic information about a list of schools and the curriculum offered in them. When a user asked for a specific student’s curriculum it picked the first occurrence of curriculum of a random school from the content and generated the answer. It even said that the said student studies in that school. Ideally, it should not have answered it.

Observation: We provide the last 3 conversations (what the user asked and the assistant’s response based on retrieval). When chat history is removed, with the same system prompt and content it answers as expected (Sorry, I am unable to answer your query.)

Is there any relation between chat history and hallucination?
Any advice or suggestions on how to reduce or eliminate this problem of “hallucination” in the bot.

Below is the prompt that we use

"“Answer the user question based on the content provided below: Content: {Content fetched from Knowledge base} If the content doesn’t have the answer for the user asked question. Please respond with Sorry, I am unable to answer your query.”

Welcome to the developer forum.

The issue you’re encountering—where the chatbot “hallucinates” information not present in the original dataset—is a common one in retrieval-based and generative models. This problem can be exacerbated when chat history is present because the model might mistakenly use that context to generate an answer that seems relevant but is factually incorrect.

On relation between chat history and hallucination: Context can guide a model’s response, but it can also misguide. When you include the last three conversations, the model may be influenced to produce a more “concrete” answer, thinking it has enough context to do so. However, as you’ve observed, this is not always beneficial and can lead to hallucinations.

Reducing hallucination:

a. Prompt Engineering: Your prompt is quite explicit, but you may want to make it even more stringent. You could add sentences that explicitly ask the model not to extrapolate from the data. What temperature are you using?

b. Confidence Scoring: Implement a confidence score mechanism to assess the relevance of the generated response to the query and the provided content. If the score is below a certain threshold, default to “Sorry, I am unable to answer your query.”

c. Post-processing: After the model generates an answer, you could add another layer of validation to verify the factual accuracy of the response against the data before sending it to the user.

d. User Feedback Loop: Allow users to flag incorrect answers, which could be used to fine-tune the model or adjust its confidence thresholds.

It’s a tough challenge, and you’ll likely have to combine multiple strategies to minimize this issue effectively.


I think this example from the cookbook might be exactly what you need :wink:


Here is a notional chart when determining to use RAG, Fine-tune, or hybrid RAG+Fine-tune (source)

So, indeed, looks like a hybrid option might be the best to prevent hallucinations.


That chart is good and so helpful. Really makes explaining it to other much easier. Thanks for sharing this. Cheers !


Here is a nonsensical chart


Thanks Paul. Temperature is set at 0. I have also given more stringent instructions not to answer outside context but still it does that.

When I pass conversation history, I am not passing the context it used to generate the answers - only the queries and responses. I am doubting that the model looks at it and thinks it hallucinated the past answers and hence starts to hallucinate. Do you think that is a possibility?

It’s not that as much that the previous responses have similar tokens and begin to form a pattern that the LLM sees if that makes sense. I would work on the prompt a bit more.

We’re doing this, focusing on RAG debugging at WhyHow.AI. We plug into any existing LLM/RAG system you have and focus on creating a knowledge graph based on the feedback that your tester/PM has on an output.

We’ve reduced hallucinations by 80%, debugged errors in seconds with only natural language, and reduced the time to send systems into production. Happy to see where we could be helpful!

1 Like

Could you please suggest what we need to do with the prompt to avoid this?

You’d have to build something more sophisticated than a prompt unfortunately. Prompt injection can only take you so far in reducing hallucinations, and that’s something we’ve faced across multiple clients. (Also at some point the number of hallucinations you have to squash make putting more chat history in the context window unfeasible from a cost perspective + I think you’ve seen it doesn’t work too well)

I’m happy to hop on a call to better advise how to solve this - we basically started deploying a knowledge graph and LLM hybrid model. Jerry from Llamaindex has been aggressively promoting this method recently on twitter.


What do you mean by this? I looked up Jerry’s posts on the web and found this. Is this what you mean?

1 Like


Not that one - that’s describing a more structured process for vector retrieval, which wouldn’t be a good process in your specific example.

Here’s a great list that Jerry came up with but it’s incomplete. We’ve built all of these for various purposes, but found the Custom Combo Query Engine to be super powerful and general (i.e. you can use logical deduction & prevent hallucinations & do multi-hop rag with the same engine).

For Shamy’s problem though, I think you wouldn’t need to go so far. Simply insert a knowledge graph as a post-processing step, insert the rule into the KG with natural language (so it takes literal seconds to set up) ‘X student curriculum is not Y’, and the LLM should output ‘I cannot answer your query’. It’s an RLHF model for debugging that works remarkably well, and isn’t in the list above because you’re not using the KG for full answer retrieval, you’re using the KG for full non-answer retrieval, which is an interesting nuance that reduces the amount of tech you need to set up.

That would be the best fix in my opinion.


What is your Knowledge Graph? Is it a vector database?

1 Like

Nope, we use a vector database as well as a Neo4j database in parallel. The company information that’s used is stored in the vector database. The rules we want enforced are stored in the Neo4j database. We use a system somewhat similar to ’The GenAI stack’ that Justin Cormack announced in DockerCon that links LangChain and Neo4j together. (I’d include a link but not allowed to post links here)


You also got a decission layer which queries this only when it can expect the right data?

Sorry to be pedantic but what does ‘this’ refer to? I assume you mean ‘this’ = ‘knowledge graph’ vs a vector database query engine?

The short answer is yes, but the exact structure should depend on what you’re looking to do - you can certainly set it up where it only queries the knowledge graph for problematic questions.

Nah just wanted to ask if you have that already. Thought you might need some help.

Nice visualisation though.

I maintain the chat history, but I don’t send it to the model (with the question) – instead, I use it to create a standalone question which I send to the model along with the context documents. I’ve not had any problems (ever) with hallucination in my RAG implementation. When the answer isn’t in the retrieved documents, the model (whether gpt-3.5, gpt-3.5-turbo or gpt-4) always responds as it it is instructed by the system prompt (which is also always sent with the standalone question).


Have you used any tool in your RAG implementation? I noticed, a chat model starts to hallucinate more often when it has more ‘power’. I also didn’t noticed any hallucinations with in a simple RAG app that only retrieve data from my documents, but when I added 1 more tool, the model sometimes ignore instructions and does whatever it wants