Hallucination in retrieval augmented chatbot (RAG)

shamy · October 2, 2023, 8:22am

Our use case: Retrieval-based chatbot that answers users’ queries based on a knowledge base.

Issue: The content has generic information about a list of schools and the curriculum offered in them. When a user asked for a specific student’s curriculum it picked the first occurrence of curriculum of a random school from the content and generated the answer. It even said that the said student studies in that school. Ideally, it should not have answered it.

Observation: We provide the last 3 conversations (what the user asked and the assistant’s response based on retrieval). When chat history is removed, with the same system prompt and content it answers as expected (Sorry, I am unable to answer your query.)

Is there any relation between chat history and hallucination?
Any advice or suggestions on how to reduce or eliminate this problem of “hallucination” in the bot.

Below is the prompt that we use

"“Answer the user question based on the content provided below: Content: {Content fetched from Knowledge base} If the content doesn’t have the answer for the user asked question. Please respond with Sorry, I am unable to answer your query.”

PaulBellow · October 2, 2023, 8:34am

Welcome to the developer forum.

The issue you’re encountering—where the chatbot “hallucinates” information not present in the original dataset—is a common one in retrieval-based and generative models. This problem can be exacerbated when chat history is present because the model might mistakenly use that context to generate an answer that seems relevant but is factually incorrect.

On relation between chat history and hallucination: Context can guide a model’s response, but it can also misguide. When you include the last three conversations, the model may be influenced to produce a more “concrete” answer, thinking it has enough context to do so. However, as you’ve observed, this is not always beneficial and can lead to hallucinations.

Reducing hallucination:

a. Prompt Engineering: Your prompt is quite explicit, but you may want to make it even more stringent. You could add sentences that explicitly ask the model not to extrapolate from the data. What temperature are you using?

b. Confidence Scoring: Implement a confidence score mechanism to assess the relevance of the generated response to the query and the provided content. If the score is below a certain threshold, default to “Sorry, I am unable to answer your query.”

c. Post-processing: After the model generates an answer, you could add another layer of validation to verify the factual accuracy of the response against the data before sending it to the user.

d. User Feedback Loop: Allow users to flag incorrect answers, which could be used to fine-tune the model or adjust its confidence thresholds.

It’s a tough challenge, and you’ll likely have to combine multiple strategies to minimize this issue effectively.

N2U · October 2, 2023, 9:18am

I think this example from the cookbook might be exactly what you need

curt.kennedy · October 2, 2023, 11:51pm

Here is a notional chart when determining to use RAG, Fine-tune, or hybrid RAG+Fine-tune (source)

So, indeed, looks like a hybrid option might be the best to prevent hallucinations.

udm17 · October 3, 2023, 6:26am

That chart is good and so helpful. Really makes explaining it to other much easier. Thanks for sharing this. Cheers !

_j · October 3, 2023, 7:20am

Here is a nonsensical chart

shamy · October 3, 2023, 10:49am

Thanks Paul. Temperature is set at 0. I have also given more stringent instructions not to answer outside context but still it does that.

When I pass conversation history, I am not passing the context it used to generate the answers - only the queries and responses. I am doubting that the model looks at it and thinks it hallucinated the past answers and hence starts to hallucinate. Do you think that is a possibility?

PaulBellow · October 3, 2023, 4:42pm

It’s not that as much that the previous responses have similar tokens and begin to form a pattern that the LLM sees if that makes sense. I would work on the prompt a bit more.

chiajy2000 · October 5, 2023, 4:13pm

We’re doing this, focusing on RAG debugging at WhyHow.AI. We plug into any existing LLM/RAG system you have and focus on creating a knowledge graph based on the feedback that your tester/PM has on an output.

We’ve reduced hallucinations by 80%, debugged errors in seconds with only natural language, and reduced the time to send systems into production. Happy to see where we could be helpful!

shamy · October 6, 2023, 3:53am

Could you please suggest what we need to do with the prompt to avoid this?

chiajy2000 · October 6, 2023, 4:15am

You’d have to build something more sophisticated than a prompt unfortunately. Prompt injection can only take you so far in reducing hallucinations, and that’s something we’ve faced across multiple clients. (Also at some point the number of hallucinations you have to squash make putting more chat history in the context window unfeasible from a cost perspective + I think you’ve seen it doesn’t work too well)

I’m happy to hop on a call to better advise how to solve this - we basically started deploying a knowledge graph and LLM hybrid model. Jerry from Llamaindex has been aggressively promoting this method recently on twitter.

ADefWebserver · October 6, 2023, 12:31pm

What do you mean by this? I looked up Jerry’s posts on the web and found this. Is this what you mean?

chiajy2000 · October 6, 2023, 2:27pm

Hey!

Not that one - that’s describing a more structured process for vector retrieval, which wouldn’t be a good process in your specific example.

Here’s a great list that Jerry came up with but it’s incomplete. We’ve built all of these for various purposes, but found the Custom Combo Query Engine to be super powerful and general (i.e. you can use logical deduction & prevent hallucinations & do multi-hop rag with the same engine).

For Shamy’s problem though, I think you wouldn’t need to go so far. Simply insert a knowledge graph as a post-processing step, insert the rule into the KG with natural language (so it takes literal seconds to set up) ‘X student curriculum is not Y’, and the LLM should output ‘I cannot answer your query’. It’s an RLHF model for debugging that works remarkably well, and isn’t in the list above because you’re not using the KG for full answer retrieval, you’re using the KG for full non-answer retrieval, which is an interesting nuance that reduces the amount of tech you need to set up.

That would be the best fix in my opinion.

ADefWebserver · October 6, 2023, 3:24pm

What is your Knowledge Graph? Is it a vector database?

chiajy2000 · October 6, 2023, 4:07pm

Nope, we use a vector database as well as a Neo4j database in parallel. The company information that’s used is stored in the vector database. The rules we want enforced are stored in the Neo4j database. We use a system somewhat similar to ’The GenAI stack’ that Justin Cormack announced in DockerCon that links LangChain and Neo4j together. (I’d include a link but not allowed to post links here)

jochenschultz · October 6, 2023, 7:36pm

You also got a decission layer which queries this only when it can expect the right data?

chiajy2000 · October 6, 2023, 11:28pm

Sorry to be pedantic but what does ‘this’ refer to? I assume you mean ‘this’ = ‘knowledge graph’ vs a vector database query engine?

The short answer is yes, but the exact structure should depend on what you’re looking to do - you can certainly set it up where it only queries the knowledge graph for problematic questions.

jochenschultz · October 6, 2023, 11:48pm

Nah just wanted to ask if you have that already. Thought you might need some help.

Nice visualisation though.

SomebodySysop · October 8, 2023, 10:08am

I maintain the chat history, but I don’t send it to the model (with the question) – instead, I use it to create a standalone question which I send to the model along with the context documents. I’ve not had any problems (ever) with hallucination in my RAG implementation. When the answer isn’t in the retrieved documents, the model (whether gpt-3.5, gpt-3.5-turbo or gpt-4) always responds as it it is instructed by the system prompt (which is also always sent with the standalone question).

Kobiel · October 11, 2023, 7:11pm

Have you used any tool in your RAG implementation? I noticed, a chat model starts to hallucinate more often when it has more ‘power’. I also didn’t noticed any hallucinations with in a simple RAG app that only retrieve data from my documents, but when I added 1 more tool, the model sometimes ignore instructions and does whatever it wants

Topic		Replies	Views
How to pass conversation history back to the API API chat-completion	14	38757	April 1, 2024
Getting ChatGPT to Remember Previous Chat Messages Prompting	37	69169	January 29, 2024
Context aware context for follow-up question Prompting embeddings , gpt-4 , api	13	8970	October 16, 2024
How do you maintain historical context in repeat API calls? API	29	90920	December 23, 2023
Do 'MAX tokens' include the follow up prompts and completion in a single chat session API token	22	5349	August 25, 2023

Hallucination in retrieval augmented chatbot (RAG)

Related topics