QA fine-tuned chatbot not answering from the trained data but nonfactual

I have been testing the capability of a fine-tuned model for QA bots to answer questions related to the companies’ information. Approximately 1000 prompts and completions were prepared as training data. When using the fine-tuned model, however, it answers nonfactual sentences most of the time, even when the prompts are exactly the same as the data provided. I have tested based on information on the web and this forum but so far no luck. If you have any suggestions to try something else than what I have done, I really appreciate your help.

Goal : Bot to answer from the pre-trained data if the prompt is similar to the trained ones.
Problem : Bot doesn’t answer from the pre-trained data even prompt is exactly the same as the pre-trained ones but nonfactual answers.

What I have tried :

  • Base mode curie and davinci
    I have made a fine-tuned model based on both curie and davinci for the each case below. It didn’t help when prompting the pre-trained data.
  • Changed parameter for completion
    The lower temperature seems to be better.
response = openai.Completion.create(
  engine=CHATBOT_MODEL,
  prompt=  BOT_PREFIX + aprompt,
  temperature=0.0, # changed value from 0 to 1
  max_tokens=200,
  top_p=1, # changed value 0 or 1 or comment
  # best_of=1, # changed value 0 or 1 or comment
  stop = ["###","->","\n"], 
  frequency_penalty=0,
  presence_penalty=0
)
  • Changed Prefix of prompt
    In front of the prompt added one of the following prefix sentences. Still cannot find the best sentence.

    BOT_PREFIX= "The following is a conversation with an AI assistant called BOT and a user. BOT is empathic and friendly. BOT's objective is to help the user find StartUp companies. With each response, BOT offers follow-up questions to encourage openness and tries to continue the conversation in a natural way. ### "

    BOT_PREFIX = 'The following is a conversation with an AI assistant called BOT. BOT is helpful, creative, clever, and very friendly. If you ask BOT a question that is rooted in truth, BOT will give you the answer. If you ask BOT a question that is nonsense, trickery, or has no clear answer, I will respond with "Sorry, I am not sure. I will learn more to support you.". ### '

  • Changed the format of pre-trained data (1000+). Changed the length of the description from a long one to a summarized short one. Shorter simple data looks to respond better but even shorter simple ones mostly don’t answer as expected.

# CASE I : prompt ends with "->", completion ends with "\n"
{"prompt":"Tell me about ABC ->","completion":" ABC belongs to Web3. HQ is in USA. Their business is related to Financial Services,Media and Entertainment,Other,Payments,Software. ABC is a blockchain technology company that develops NFTs and digital collectibles.\n"}
{"prompt":"Tell me about BCD ->","completion":" BCD belongs to CyberSecurity. HQ is in ISR. Their business is related to Consumer Electronics,Hardware,Information Technology,Privacy and Security,Software. BCD is a breach and attack simulation platform that helps organizations verify their security posture.\n"}
# CASE II: prompt ends with "###" and starts with "User: ", completion ends with "###"
{"prompt":"User: Tell me about ABC ###","completion":" ABC belongs to Web3. HQ is in USA. Their business is related to Financial Services,Media and Entertainment,Other,Payments,Software. Yuga Labs is a blockchain technology company that develops NFTs and digital collectibles. ###"}
{"prompt":"User: Tell me about BCD ###","completion":" BCD belongs to CyberSecurity. HQ is in ISR. Their business is related to Consumer Electronics,Hardware,Information Technology,Privacy and Security,Software. BCD is a breach and attack simulation platform that helps organizations verify their security posture. ###"}

Output Example

User: Tell me about BCD
BOT:  BCD is a Cynefin-based AI company that helps companies make sense of their data.
# I expect something similar to the pre-trained completion one, but the answer is something nonfactual

My thoughts :

  • The pre-trained data format may impact the response quality, so I tested the above format and a few different ones referring to other websites and QA forums but so far no big gain.
  • Prefix sentences in prompts also have an impact on the quality of response but so far I don’t come up with better ones than above.
  • What else can I try …
8 Likes

I’m facing a similar issue and I’m going through a new round of tests today.
I’ll share my results to help us solve the same problem.

To me it looked like my prompts look closely to the vast collection of prompts in GPT-3’s model.
So the engine responded with something which was more dominant in the model’s engine.

When I tried to make the prompts unique, the problem seemed to be the low number of prompts in my dataset and when GPT-3 had to choose what to respond, it picked the majority of prompts over the most close one.

What I’m trying today is to generate a larger dataset of prompts with very specific context, which are very unlikely to be found in GPT-3’s model.

If I’d give a feedback to openai, I’d suggest to include an option to set the priority of my dataset in the fine-tuned model. (cc: @pw )

2 Likes

Hi guys, I might be wrong, but on my opinion, fine-tuning is aimed to improve model results quality and not to train it to answer based on factual data.

What would work is:

  • Create a document with facts
  • embed the facts to get vectors (store vectors on your end)
  • when a bot recieves a question about facts, embed the question to get it’s vector, then using cosine similarity algorithm, find the facts that are close by vectors and give those facts as context to the prompt to the bot to use those facts to answer the question.

There is a guide on using embeddings for question answering in the docs.

6 Likes

Thanks @sergeliatko, I’m still going through the documentation.
There are use cases where facts have to mix with model knowledge base.

I don’t have this particular use case, but similar cases like the one below could occur.

Let’s say we put this piece of information in an embedding:

John Doe couldn’t sleep last night because of the party upstairs.

Prompt:

What may cause John Doe to forget his wallet at home this morning?

The output should be:

John Doe couldn’t sleep last night because of the party upstairs.

But in order to determine that the poor sleep is to blame, we need to determine first if the prompt has to look in embeddings first or not.
In this case it should look first what causes people to forget and this information is not in the embeddings.
Once the reasons for forgetting are retrieved, it should compare with the embeddings.

Would that work?

1 Like

@georgei, @sergeliatko, thank you for the advice!

The use of embedding to measure cosine similarity seems solid.
As a brief test, I used the “text-similarity-babbage-001” model (2048 dimensions) and the cosine comparisons are below.

Trained Prompt: Tell me about BCD

If the user query the following similar prompts, the cosine similarity with Trained Prompt looks below.

Prompt : What is BCD
Cosine Similarity : 0.930927151

Prompt : Please describe about BCD
Cosine Similarity : 0.947527438

Prompt : what is BCD like?
Cosine Similarity : 0.893752611

If the user query non related one, the similarity value looks lower.

Prompt : Tell me about XYZ
Cosine Similarity : 0.789166749

Still very poor sample but in this approach, if I set the threshold around 0.85 for cosine similarity, I know what to respond from the training data and I don’t even need to call API… It sounds more like a rule based chatbot with AI backup for the prompts which are not covered by the similarity of training data. It can be a good backup solution if none of my other attempts work.

I’m still trying to figure out how to handle trained facts with one model. If any progress is made, I will let you know. And I appreciate any update or idea.

1 Like

I got a pretty good result with this:

Here is what I did.
The dataset had about 20 prompt variations for the same completion.
The higher the number of prompt variations, the higher the chance to obtain the expected completion.

The most specific prompt in my case was like this:
{ "prompt": "Given the site ID 'PB1-U2', the topic 'X', the content type 'Y', the content category 'Z' and the date Wed Oct 19 2022, answer the question: What caused the damage to the Nord Stream gas pipeline from Russia to Europe?\n", "completion": "Danish police say powerful explosions caused the damage.\n" }

And the least specific prompt was like this:
{ "prompt": "What caused the damage to the Nord Stream gas pipeline from Russia to Europe?\n", "completion": "Danish police say powerful explosions caused the damage.\n" }

On the playground I was able to ask all sorts of questions including this one:

What has caused the damage to the Nord Stream LNG gas pipeline?

And the response was:

Danish police say powerful explosions caused the damage.

There is still a lot of work to improve the fine tune, but I’m optimistic.
And I’ve used curie, not davinci yet.

@georgei Thank you for sharing your update. Wow, it is very interesting!

I tried your prefix sentence and other similar ones to see if I could get better results. The response is slightly better, but in my case, it is still far from satisfactory…

The reason might be the prompt variations as you pointed out. I have 1000 training sample but each one is independent. So my next try is to add 3 variations for each one and update fine-tune model to see if it improves response. Once I have the results, I will update.

1 Like

Please read Text search embeddings here OpenAI API again.

Get your facts in a file. Break your file into paragraphs and get text similarity doc embedding for each paragraph. Store results (text->vector) on your end.

When bot is asked a question, embed the query with text similarity query embedding to get the vector to do the cosine similarity against the saved vectors to get the most ranked texts.

Then create a prompt that looks like:

Bot description prompt.
Previous conversation summary: summary.
List of context documents:
Doc 1 text

Doc 5 text

User question: user question text?
Bot answer:

Then run that prompt against text-davinci-002 to get the bot answer. I bet the result will be good and there is probably no need for fine tuning.

1 Like

Just to make sure you guys understand that fine-tuning purpose is to train the model on “how to complete the prompt” or “how to answer a question” in your case. It’s all about the manner here.

It has nothing to do with “what to include into the reply”.

You particular cases are about “what to answer”.

1 Like

@sergeliatko is exactly right. You cannot use fine tuning to teach GPT-3 about your facts or to restrict GPT-3’s answers to your dataset.

I think that @cavalierski case is closer to the embedding solution.

I’ll need to combine actual facts with knowledge base from a model.
An example is to return the stocks value of a company from my dataset, along with the sentiment analysys of the text input - information not provided in my dataset.

In this case “teach your bot to run sentiment analysis”… Kidding.

You need a filter between the user and the bot. If the filter detects that the user question to be answered needs an external query / task etc. result be provided to the bot as context - the filter raises a flag for your code to perform the needed task, get result and include that as context.

Then the same thing from my previous workflow + don’t forget to include the freshly generated context into the prompt so that the bot has both: facts and conclusions.

Don’t get into the trap of ML people saying (imagining) AI can solve everything in one shot. Even the best AI out there will always be just a “neural” response to a trigger until there is a controller (another AI network or simple algorithm) who decides what to do next and why. Even insects take decisions and plan, which implies multi step thinking.

3 Likes

Thank you @sergeliatko for the input.

According to your guidance, I tested several implementations based on the Embeddings doc and sample code to determine which one is best suited for my use case. Consequently, the below appears to be working well.

1: Prepare training data with prompt, completion and some other columns in CSV.

  • Having tested to prepare training data with variant prompt/completion, I don’t think I need to do that too much since the cosine similarity score can handle it in my case having very short prompt.

2: Add ‘babbage_similarity’ column using engine=‘text-similarity-babbage-001’ and ‘babbage_search’ column using engine=‘text-search-babbage-doc-001’ against ‘prompt’ column.

  • I tried making a combined column concatenating ‘prompt’ and ‘completion’, but the prompt tends to be very short, so it’s better to use ‘prompt’ only in my case.

3: When query comes, call engine=“text-search-babbage-query-001” to calculate ‘similarities’ score
4: Sort by ‘similarities’ score and select top 3 if the value is more than 0.35. Otherwise, make it empty
5: If there is any match make the following prompt to fire completion API to engine=“text-davinci-002”. If no match, make <BASE_EMBEDDED_DESCRIPTIONS> empty.

The format of prompt is like this.

<BASE_PREFIX_DESCRIPTION> \n\n

<BASE_EMBEDDED_DESCRIPTIONS> (each prompt/completion seperated by '###')
USER: qqqqqq ->
BOT: aaaaaa ###  \n\n

<USER_PROMPT> ->

Params

    response = openai.Completion.create(
        engine=CHATBOT_MODEL, //text-davinci-002"
        prompt=  BOT_PREFIX + aprompt,
        temperature=0.0,
        max_tokens=200,
        top_p=1,
        best_of=1,
        stop = ["###"," ->"],
        frequency_penalty=0,
        presence_penalty=0
    )

So far, I have tested some cases and they work as expected. There are almost all the given facts in the response. For the question not in the training data, “text-davinci-002” responded nicely. Without using the cache, the TAT for the above process is around 3-4 seconds. Similar questions can be answered within a few milliseconds if I enable the cache.

The next challenges are:

  • Embedding file size and processing time

    • I used 1K samples in the above experiment. The next step will be to try with 300K samples.
    • The size of a 1K sample is approximately 200KB, and after embedding the data, it becomes 100MB. Right now, I am preparing 300K samples for pre-processing, which has now taken over two hours. I assume if it finishes successfully, the file size may be around 30GB guessing it increases linearly. It probably causes the performance issue. In this case, I may need to store those data in DB or so but not in a file.
  • Embedding engine
    Following the sample code, I selected the embedding model, but haven’t tested others. In order to select the best one for my use case, I need to understand the differences between them.
    text-similarity-xx
    text-search-xx-doc
    text-search-xx-query

2 Likes

Small update.
I was running 300K (100MB) sample training data for embedding but after a few hours, OOM error was thrown from OS. (The dev server has 16GB memory and 4GB Swap.) I reduced the sample record to 130K (33MB file size) to see if it can handle this amount and the performance of pre embedding search.

Thanks @lmccallum @sergeliatko for your helpful insight.

My understanding is that the Transformer-based model processes full sequences in parallel. And Attention is a matrix of weighted contributions of each input element to each output element. So if fine-tune model is generated by a certain amount of training data (Unsupervised Pre-training), learning occurs in such a way that it predicts the next word.

For example, the next word in a sequence of words “a”, “robot”, and “must” is learned without knowledge of what the next word is. If training data is properly taken into account with a certain weight, it should return a word that is provided by the training data.

Unfortunately, it is not the case for me as of now. So what I still don’t understand is what the fine-tune model is for if it doesn’t contribute to the learning correspondence between input strings and output strings. I understand that the base model has huge amount of data and the fine-tune training data gives less impact on selecting the next word. But what could be the use case of fine-tune model then, considering the embedding context has a high priority.

2 Likes

What I don’t get is why would you embed your prompts? What is the benefit of that you are looking for?

I would just embed my facts to be able to find the context for the bot to answer based on.

Then, I would run my app for a while with the ability to edit the bot reply, before validating it by a human and saving to my training jsonl file {prompt:text,completion:text}

Once I have more than 400 replies saved, I would start fine tuning the model on my training file.

Every 500 replies I would fine tune a new version of the previous model with n_epochs 2 on a new data.

I really don’t see the point of embedding my prompts in this scenario unless I’m missing something.

1 Like

How your facts look like? Any example?

I bet your facts can be embedded with text-search-ada-doc to give you nice context (before hand), and text-search-ada-query for user questions live when they arrive.

Hi cavalierski

I am something of a prompt wizard. I have some observations that might help.

  1. yes if you want the model to respond with it’s strongest correlations (ie your “truth”) then reduce temperature to 0. That will be the “truthiest” it can get.
  2. Fine tuning does not work so well, 1000 examples is not enough. 10k is more what you need. When you say it is giving non-factual answers this is impossible. “Truth” in this context is what it has been trained to say, so it cannot lie. Do you mean it is not saying things it should have been trained to say? then your training data has contradictions in it. It says X is both Y and Z, so that is the cause of the variation in answer. Again see 1-2. Or…
  3. you either need to block any questions on topics you do not want to / cannot talk about well or
  4. you might get better results if you use a davinci2 prompt for the truthful chatting (can show you how to prompt for this) with a semantic search dynamic prompt of your company database info
  5. it might want to ask clarifying questions as well

In essence, you need to make something more self-aware. Which is what i have done.

Half of the prompts issues i see are from people who are trying to make an AI partially self-aware, in your case, self-aware enough to know truth from falsity, or to ask clarifying questions if it does not fully appreciate the nuances of the question. This is possible, but requires a nuance of prompting assuming self-awareness is possible and then making it from nothingness.

So you need to make it more self-aware (knowing truth, knowing what was said, comparing it, deciding what to say, saying it) or less (rejecting any question it does not have a very truthful answer for at the ready).

note: when i say self-aware, perhaps not to incur the wrath of those who hold it in magical regard, let’s say artificially, or analogously, or essentially, “self-aware.” As i tire greatly of trying to educate those of dull dreams or distinctions :slight_smile:

Hope that helps! If you need anymore help i am at your disposal, do not hesitate to ask!

2 Likes

Well 250 high quality examples of how to detect a hidden clause title inside a paragraph of a legal document and extract it was enough to get great results to start with using fine tune DaVinci. In french.

So 10k examples for a bot might be an over kill. Again it depends on the actual data itself.