Hallucination problem in my QnA bot which was built using openAI embeddings and gpt 3.5-turbo completions API

I created a custom QnA bot on the HR policies of my office. I first created some questions using openai API from my context and then I generated it’s embeddings. Finally, I used gpt 3.5 turbo’s completion API to answer questions. I am facing hallucination problems as the model is not returning right factual answers.

1 Like

Welcome to the forum.

What settings are you using (temperature, etc.)… What’s your prompt look like - system message, etc. How are you sending the information. Where do you query the embedding in the process?

More info should help us get you moving in the right direction.

  1. I have set the Temperature to 0.1 as I don’t want the model to deviate from the facts.
  2. “”“Return a message for GPT, with relevant source texts pulled from a dataframe.”“”
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    introduction = ‘Be consice with the reply and Use the below section from HR policy to answer the subsequent question in few lines. If the answer cannot be found in the section, write “I could not find an answer.”’
    question = f"\n\n Question : {query}"
    message = introduction
    for string in strings:
    next_article = f’\n\n policy section:\n"“”\n{string}\n"“”’
    if (
    num_tokens(message + next_article + question, model=model)
    > token_budget
    ):
    break
    else:
    message += next_article
    return message + question

conversation_history = [
{“role”: “system”, “content”: “You are HR assistant that answers questions consicely about the HR policies correctly. Answer the question as truthfully as possible, and if you’re unsure of the answer, say Sorry, I don’t know”},
]
def ask(
query: str,
df: pd.DataFrame = df,
model: str =COMPLETIONS_MODEL,
token_budget: int = 4096 - 500,
print_message: bool = False,
) → str:

“”“Answers a query using GPT and a dataframe of relevant texts and embeddings.”“”
conversation_history.append({“role”: “user”, “content”: query})
if print_message:
print(conversation_history)
response = openai.ChatCompletion.create(
model=model,
messages=conversation_history,
temperature=0.1
)
response_message = response[“choices”][0][“message”][“content”]
conversation_history.append(response[“choices”][0][“message”])
return response_message

  1. After the user enters the query I generate it’s embeddings and match them with data embeddings using spatial cosine similarity to find the most related document

Might try putting this more toward the end of the prompt and put the stuff you pull after embeddings search further up in the prompt. Location of things in the prompt matter a lot. You might play with bringing the temperature up a bit to test over .5 up to .6 or .7 maybe.

Good luck! Let us know if you’re still having trouble…

I changed my code below like you said but i am still facing the same issue, I also played with the temperature but no positive output

def query_message(
query: str,
df: pd.DataFrame,
model: str,
token_budget: int
) :
“”“Return a message for GPT, with relevant source texts pulled from a dataframe.”“”
strings, relatednesses = strings_ranked_by_relatedness(query, df)
introduction = ‘Be consice with the reply and Use the above section from emumba HR policy to answer the subsequent question in few lines. If the answer cannot be found in the section, write “I could not find an answer.”’
question = f" Question : {query}"
message = " "
for string in strings:
next_article = f’\n\nEmumba policy section:\n"“”\n{string}\n"“”’
if (
num_tokens(message + next_article + question, model=model)
> token_budget
):
break
else:
message += next_article
return question + message + introduction

I have tried doing the steps you said in my code but it is still the same

def query_message(
query: str,
df: pd.DataFrame,
model: str,
token_budget: int
) :
“”“Return a message for GPT, with relevant source texts pulled from a dataframe.”“”
strings, relatednesses = strings_ranked_by_relatedness(query, df)
introduction = ‘Be consice with the reply and Use the above section from emumba HR policy to answer the subsequent question in few lines. If the answer cannot be found in the section, write “I could not find an answer.”’
question = f" Question : {query}"
message = " "
for string in strings:
next_article = f’\n\nEmumba policy section:\n"“”\n{string}\n"“”’
if (
num_tokens(message + next_article + question, model=model)
> token_budget
):
break
else:
message += next_article
return question + message + introduction

I had similar issues when I was developing this FAQ system. Now it’s pretty well-behaved. Here are some things I did to make it more of a data science project and less like a wing and a prayer. :wink:

  • I’m using a temperature of zero with ADA for the embeddings and GPT-003 for the inferencing.
  • Embedding vector values are a concatenation of question + answer with similarity matching based on user questions.
  • The process isolates on the top three similarity matches and then measures the average similarity score. If the average falls below a specific threshold, I declare the question irrelevant and shut down the inferencing step and return a general message. A high incidence of these rejections in automated and user testing indicates that your corpus is weak; it needs more content.
  • I create and manage a detailed corpus in Coda and a separate table of test queries that are automated. I use this approach to put the “engineering” in prompt engineering. :wink:
  • Every test question is measured every time something is changed, and those that underperform indicate changes to the corpus are needed.
  • My tests are also versioned; we use data visuals to track progress and improvement.
  • Every question and answer is logged into an analytics tracker that tells me how long the inference took to complete. Each one is ranked for quality, and those that are poor are automatically sent to the testing framework to improve the corpus.
  • I also built testing helpers to synthesize better answers (see improvement app). It uses GPT itself to rewrite or clarify texts and update the corpus.

FAQ Testing Framework

FAQ Tracking Framework

Improvement App

3 Likes

This is the part where I determined that the most effective way to encourage the LLM to provide specific facts also required a learner prompt. For example - when a customer asks:

When and where do I take delivery of my CyberLandr?

The bot’s logic finds the top three embeddings based on DOT Product similarity. It then builds a learner prompt with the top three questions and answers. Finally, the user’s question is presented with a prompt for the answer.

Answer the Question using the best response and rewrite the answer to be more natural.

Question: Do I have to take delivery of CyberLandr before my Cybertruck is available to me?
Answer: No. One of the nice things about our reservation system is that it matches the Tesla approach. You are free to complete your CyberLandr purchase when you have your Cybertruck to install it into. With all of the unknowns concerning Cybertruck's production timeline, we are making it as flexible as possible for truck buyers to marry CyberLandr with their vehicle as soon as it comes off Tesla's production line. This is why we will be manufacturing CyberLandr in Giga, Texas.

Question: When will CyberLandr become available?
Answer: We anticipate CyberLandr will be ready for delivery approximately when Tesla begins producing and shipping Cybertruck to its list of reserved purchases.

Question: When will the CyberLandr prototype be completed?
Answer: There will be many prototypes before we ship. We will share info about some of those with the CyberLandr community (reservation holders). And we will unveil the final production prototype publicly before production begins.

Question: When and where do I take delivery of my CyberLandr?
Answer: 

UPDATE:

Without seeing some examples of the embedding corpus, it’s difficult to say with precision how it should be changed. But I think it’s fundamentally the prompt. You might try pasting an actual prompt here so we can examine it more closely. Trying to assume how your code may execute is difficult.

Hi bill,
you did some greatjob!!
I’m working on a similar project of FAQ Chatbot, i would like to improve my poc with some tracking and monitoring tools for experiments and usage, could you tell me which framework or tool you used to get theses visuals?
ty :slight_smile:

1 Like

I used Coda. It’s ideal as a platform for building custom tools that need to blend lots of text with workflows. From text prompts to text inferences and analysis - even using Cida AI - this is the place to build near-perfect systems to building, testing, and documenting generative AI projects.

Take a look at Promptology (a hackathon winner).

1 Like