QA fine-tuned chatbot not answering from the trained data but nonfactual

What I don’t get is why would you embed your prompts? What is the benefit of that you are looking for?

I would just embed my facts to be able to find the context for the bot to answer based on.

Then, I would run my app for a while with the ability to edit the bot reply, before validating it by a human and saving to my training jsonl file {prompt:text,completion:text}

Once I have more than 400 replies saved, I would start fine tuning the model on my training file.

Every 500 replies I would fine tune a new version of the previous model with n_epochs 2 on a new data.

I really don’t see the point of embedding my prompts in this scenario unless I’m missing something.

1 Like

How your facts look like? Any example?

I bet your facts can be embedded with text-search-ada-doc to give you nice context (before hand), and text-search-ada-query for user questions live when they arrive.

Hi cavalierski

I am something of a prompt wizard. I have some observations that might help.

  1. yes if you want the model to respond with it’s strongest correlations (ie your “truth”) then reduce temperature to 0. That will be the “truthiest” it can get.
  2. Fine tuning does not work so well, 1000 examples is not enough. 10k is more what you need. When you say it is giving non-factual answers this is impossible. “Truth” in this context is what it has been trained to say, so it cannot lie. Do you mean it is not saying things it should have been trained to say? then your training data has contradictions in it. It says X is both Y and Z, so that is the cause of the variation in answer. Again see 1-2. Or…
  3. you either need to block any questions on topics you do not want to / cannot talk about well or
  4. you might get better results if you use a davinci2 prompt for the truthful chatting (can show you how to prompt for this) with a semantic search dynamic prompt of your company database info
  5. it might want to ask clarifying questions as well

In essence, you need to make something more self-aware. Which is what i have done.

Half of the prompts issues i see are from people who are trying to make an AI partially self-aware, in your case, self-aware enough to know truth from falsity, or to ask clarifying questions if it does not fully appreciate the nuances of the question. This is possible, but requires a nuance of prompting assuming self-awareness is possible and then making it from nothingness.

So you need to make it more self-aware (knowing truth, knowing what was said, comparing it, deciding what to say, saying it) or less (rejecting any question it does not have a very truthful answer for at the ready).

note: when i say self-aware, perhaps not to incur the wrath of those who hold it in magical regard, let’s say artificially, or analogously, or essentially, “self-aware.” As i tire greatly of trying to educate those of dull dreams or distinctions :slight_smile:

Hope that helps! If you need anymore help i am at your disposal, do not hesitate to ask!

2 Likes

Well 250 high quality examples of how to detect a hidden clause title inside a paragraph of a legal document and extract it was enough to get great results to start with using fine tune DaVinci. In french.

So 10k examples for a bot might be an over kill. Again it depends on the actual data itself.

exactly. it is not the examples.

you need “high quality” examples!

10k just (maybe) has a chance to get you enough statistically high quality examples

but what is high quality?

high quality = having “aware” distinctions built into them as examples

they ask x, say y

but they ask x, excluding t, but maybe c, say a

etc

examples for every situation giving the AI an “understanding” of the way answers a,b,y, and d, all semantically relate to each other

then it “understands” what to say when asked about a,b,y and d (and the closer a question is semantically related but removed from a,b,y, and d, the closer it can answer)

yes you are right - ~90% of the problem is in the data, ~10% of the problem is in the request. IMO.

need better data/prompt

understand how it thinks, chat with it for awhile - ask it questions on what it “thinks”. i have spent hours and hours

1 Like

I might sound basic, but having statistically enough of good examples most of the cases will mean your model will produce of enough of bad completions based on what you’ve shown it to do.

The data has to be 100% excellent to get good results. That’s why including human feedback loop ASAP is so important

One strategy to get the so called “good examples” is to ask the bot to generate the prompts.
From an article or piece of information, the bot generates a set of questions and answers.
Use the set to create prompts and completions.

You can say that it is a sort of reverse psychology.

Here is such prompt ran in Playground:

Given the article below, create a JSON object which enumerates a set of 5 child objects.
Each child object has a property named “q”, a property named “a” and a property named “id”.
For each child object assign to the property named “q” a question which has its answer in the article and to the property named “a” a short answer to this question.
For each child object assign to the property named “id” the value “bloomberg-stocks-21-10-2022”.

The article: “The S&P Supercomposite Automobiles & Components Industry Index has declined nearly 38% this year, compared with the S&P 500 Index’s 22% drop.”

The JSON Object:

It will return a JSON object like the following:

[
  {
    "q": "What is the S&P Supercomposite Automobiles & Components Industry Index?",
    "a": "The S&P Supercomposite Automobiles & Components Industry Index is a stock market index.",
    "id": "bloomberg-stocks-21-10-2022"
  },
  {
    "q": "How much has the S&P Supercomposite Automobiles & Components Industry Index declined this year?",
    "a": "The S&P Supercomposite Automobiles & Components Industry Index has declined nearly 38% this year.",
    "id": "bloomberg-stocks-21-10-2022"
  },
  {
    "q": "How much has the S&P 500 Index declined this year?",
    "a": "The S&P 500 Index has declined 22% this year.",
    "id": "bloomberg-stocks-21-10-2022"
  },
  {
    "q": "What is the S&P 500 Index?",
    "a": "The S&P 500 Index is a stock market index.",
    "id": "bloomberg-stocks-21-10-2022"
  },
  {
    "q": "What is the difference between the S&P Supercomposite Automobiles & Components Industry Index and the S&P 500 Index?",
    "a": "The S&P Supercomposite Automobiles & Components Industry Index is a stock market index for the automobile industry while the S&P 500 Index is a stock market index for 500 large companies.",
    "id": "bloomberg-stocks-21-10-2022"
  }
]

If you set the temperature high, probably you will obtain a much diverse response from the bot, but still valid for the given text.

1 Like

Or you could just find something by googling: TOP 250+ Stock Market Interview Questions and Answers 21 October 2022 - Stock Market Interview Questions | Wisdom Jobs India

Then you can use it as a base and start from here.

Again, the fine tuning is to show the model the patterns in how to use prompt to answer the user question or statement, not to train it what to put into the answer. And it seems to me there are not many people who truly see the difference.

You can even craft your perfect answers using playground.

The trick is to get the first 5-7 perfect (LIKE PERFECTLY PERFECT). They will give the tone to your bot’s seed. Then keep being really picky for the next 50 replies. Make sure you have them saved.

Then out of 50 replies select 5-6 common directions, chose about 8-10 exemples in each direction and go deeper on the subject trying to show the bot the variations of reply manners and how to use prompt data to build the answer. Get about 25 replies in each direction.

Again stay picky on the quality of replies. They need to be the best you could do yourself.

Add replies to some stupid / provocative / unrelated / fabulous / insulting questions to show the bot how to answer those as well.

Save everything into a training file not forgetting to shuffle the lines.

When fine-tuning the generating model, do not forget the n_epoch parameter to be 2 (3 max I would say, but test)

Once the first fine tuning is done, you can do stage 1: run with a human feedback loop saving best answers too new training data (the goal is to get about 15 hundred before adding new testers and other blows and whistles).

Moderation filter is a must for a bot.

I would also add a “prompt hijack” filter (someone trying to talk your bot into bad things like forgetting previous prompt and starting cussing around)

Then the filters you might need for external tasks etc to give the bot external data.

Good thing to think about is saving a log of all conversations in case you need one day to implement a true memory for your bot. Just make sure the log format is automatically parsable for later use by scripts.

Those are things I would start with. Let me know if that helps.

@sergeliatko
Thank you for the reply. I am testing the use case that user ask about the company and the bot replays about the description of the company.

My training data(fact) looks like this. ABC, BCD are the company names and it has huge amount.

# Training Data
{"prompt":"Tell me about ABC ->","completion":" ABC belongs to Web3. HQ is in USA. Their business is related to Financial Services,Media and Entertainment,Other,Payments,Software. ABC is a blockchain technology company that develops NFTs and digital collectibles.\n"}
{"prompt":"Tell me about BCD ->","completion":" BCD belongs to CyberSecurity. HQ is in ISR. Their business is related to Consumer Electronics,Hardware,Information Technology,Privacy and Security,Software. BCD is a breach and attack simulation platform that helps organizations verify their security posture.\n"}

Firstly, I tried to use about 3K+ above sample data for fine-tune model which didn’t work well with curie, davinci as even asking the same prompt, it returns nonfactual response.

Secondary, I use embedding API to calculate similarity between the sample prompts and user’s input. Then provide top 3 similarity prompt and completion from training data. This is working fine so far. The steps are like this below.

1: Pre process training data in CSV to calculate embedding API. The columns of the file looks like this.

prompt,completion,babbage_similarity,babbage_search,...<Additional Columns>

2: When user asks, take the input string to calculate similarity against the pre-processed data , then select the top3 similar ones with prompt and completion for providing them to completion API request.

3: Build completion API request. The structure of the request content is :

<Prefix String>

<3 Training Context selected from Embedding API similarity against user's input>

<User's Input>

The actual request looks like this below. Parts surrounded by <> are not included in actual requests.

<Prefix String>
The following is a conversation with an AI assistant called BOT. BOT is helpful, creative, clever, and very friendly. If you ask BOT a question that is rooted in truth, BOT will give you the answer. If you ask BOT a question that is nonsense, trickery, or has no clear answer, I will respond with "Sorry, I am not sure. I will learn more.".\n\n

<3 Training Context selected from Embedding API similarity against user's input>
User: Tell me about ABC ->
BOT: ABC belongs to Web3. HQ is in USA. Their business is related to Financial Services,Media and Entertainment,Other,Payments,Software. ABC is a blockchain technology company that develops NFTs and digital collectibles. ###

User: Tell me about BC ->
BOT: BC belongs to FinTech. HQ is in USA. Their business is related to Financial Services,Media and Entertainment,Other,Payments,Software. BC is a payment technology company that develops banking solutions. ###

User: Tell me about valuation of ABC ->
BOT: ABC has valuation of 100M$ ###

<User's Input>
User: Tell me about ABC ->

So, using prompt for embedding is to select what to include in the completion API request as training data. I wonder what would work better. If you have any suggestions, I really appreciate it.

BTW

“The S&P Supercomposite Automobiles & Components Industry Index is a stock market index.” - personally I would consider the quality of this answer somewhere below 10% as it breaks one of the definition fundamentals: never define a concept using the same concept in the definition.

Also, the definition of this index cannot be extracted from the provided article… As it is not there.

@joshbachynski
Thank you for the helpful advice!

yes if you want the model to respond with it’s strongest correlations (ie your “truth”) then reduce temperature to 0. That will be the “truthiest” it can get.

Yes. I have tried from 0 to 1 and now I’ve been using 0 for temperature as it looks giving better response.

Fine tuning does not work so well, 1000 examples is not enough. 10k is more what you need. When you say it is giving non-factual answers this is impossible. “Truth” in this context is what it has been trained to say, so it cannot lie. Do you mean it is not saying things it should have been trained to say? then your training data has contradictions in it. It says X is both Y and Z, so that is the cause of the variation in answer. Again see 1-2. Or…

My use case is that the user ask about the company and bot replies about the company. One sample is about one company.

prompt: Tell me about <Company Name>
completion: <Company Name> belongs to XTech segment. HQ is in <Country>. Their business is related to <Categories they blong to>. <Short description of the company>

My fine-tune model with curie,davinci replies something mix up with other samples.

# Sample privided for fine-tune
{"prompt":"User: Tell me about BCD ->","completion":" BCD belongs to CyberSecurity. HQ is in ISR. Their business is related to Consumer Electronics,Hardware,Information Technology,Privacy and Security,Software. BCD is a breach and attack simulation platform that helps organizations verify their security posture. ###"}

If user asks exactly same prompt, it replies something mix up with others

User: Tell me about BCD
BOT:  BCD is a Cynefin-based AI company that helps companies make sense of their data.
# I expect something similar to the pre-trained completion one, but the answer is something nonfactual

you either need to block any questions on topics you do not want to / cannot talk about well or

I still need to experiment what to block to make the response solid … So far I don’t come up with any word to block but after trying it many times, there might be ones needs to be blocked.

you might get better results if you use a davinci2 prompt for the truthful chatting (can show you how to prompt for this) with a semantic search dynamic prompt of your company database info

Noted. I tried curie and davinci2 based model and now I am using text-davinci-002 with embedding approach.

it might want to ask clarifying questions as well

My prompts mostly starts from “Tell me about ” and so far with embedding API approach to measure similarity can handle questions like below prompts. Mostly due to many of the company name is identical and if the company name is in general, it probably will not work.
“Describe about ”
“Please explain about ” etc

In essence, you need to make something more self-aware. Which is what i have done.

self-aware is a very interesting and difficult topic. I am checking other QA for a deeper understanding. With my limited knowledge, Transformer and Attention are relations between words in vector to measure the most likely connected word in the language model, I am trying to understand how self-awareness fits in actual practice.

Ok I see your problem. Sorry being direct, but you’re still not getting the idea of fine-running nor the embedding.

1 Embedding - helps you find most relevant text or of many texts (facts only in your application)
2 Fine tuning - helps you adjust the manner and patterns used by a model to complete a given response.

Here is how I would do the thing.

1 for each company, create a file with all facts your bot needs to know about. Simple txt file would do.
2 cut your files into rather small pieces (about 2-3 paragraphs) with text similar to:
Company: Company name
Data: your facts and other related text
3 embed reach piece of text with text-search-ada-doc (to start) and save into your database along with the piece text without Company: Company name part.

Now start seeding the bot:

  1. Chose your first question (usually the most common but rather complicated):

What do you think of the company X in today’s situation?

(Chose another one if you have no means to tell your bot what the today’s situation is)

  1. Embedd that question, using text-search-ada-query (I guess it is better to have same embedding base model, but please correct me if I’m wrong)

  2. Compare the vector of the question with your facts in database to get several most relevant facts sorted by relevancy desc.

  3. Copy the text of those several facts without the Company: Company name part. And start creating your prompt:

Bot description… Bla bla bla.
Factual context: fact 1 the most relevant… Fact x the least relevant out of acceptable.
User current state: their mood
Conversion summary: bla bla bla…bla
User: What do you think of the company X in today’s situation?
Bot:

  1. Take the prompt from step 4 and paste it into the playground (adjust settings to your will)

  2. Run/replay a completion several times and save bots best answers (feel free to adjust/edit them manually before saving) to a separate file like this:
    —prompt—
    Bot description… Bla bla bla.
    Factual context: fact 1 the most relevant… Fact x the least relevant out of acceptable.
    User current state: their mood
    Conversation summary: bla bla bla…bla
    User: What do you think of the company X in today’s situation?
    Bot:
    —completions—
    Bot’s answer 1

    Bot’s answer 2

    Etc.

  3. Change your question and restart from step 1 until you have at least 50 questions with 2-4 answers each.

8 create a script to reformat your saved answers into seed.jsonl file with following format for each line:

{“prompt”:“Bot description… Bla bla bla.\nFactual context: fact 1 the most relevant… Fact x the least relevant out of acceptable.\nUser current state: their mood\nConversation summary: bla bla bla…bla\nUser: What do you think of the company X in today’s situation?\nBot:”,“completion”:" the saved reply.<|endoftext|>"}

Make sure it is a valid json and reach object is on its own line (or your model will “break” most likely)

  1. Once all the lines are saved it would be nice to shuffle the lines randomly to break the “flow” (use a script for that).

10.Create a fine tune using davinci as basse model and n_epoch 2

Now, populating real training data.

  1. Create an app with question input field, question submit button, reply edit field and save button.

Submitting question sends the workflow to step 2 of the previous section.

On step 4, print the created prompt on the screen and send the prompt to the API to get bot answer, put the answer into edit answer field.

  1. Edit the answer if necessary

  2. When happy, click save button. On save button your app should save exactly one line as in step 8 of previous section (make sure you save the edited answer)

  3. have fun until you get at least 1500 answers

  4. create a new fine tune “my-bot-v1-0” based on your model (again, n_epoch 2 or test)

  5. Every 1000 answers upgrade your fine tuned model from “my-bot-v1-y” to “my-bot-v1-z” by feeding new data only

@georgei

One strategy to get the so called “good examples” is to ask the bot to generate the prompts.
From an article or piece of information, the bot generates a set of questions and answers.
Use the set to create prompts and completions.

Wow, it is very interesting and surprised me with the capability of the model again. Probably, when it comes to the detail, we need a huge effort, but I can play with it for a few hours to find quirks in the model :smiley:

@sergeliatko

the fine tuning is to show the model the patterns in how to use prompt to answer the user question or statement, not to train it what to put into the answer.

Thank you. It is clear that what is fine-tune model for! And the practical approach to refining the model is very helpful!

The stocks market example I gave above it was something random from Bloomberg.
In my use case the content is absolutely unpredictable. I can have a few hints of what the content type might be, but that’s all.
What is more important, and is also written in the fine-tuning documentation, to have up to date information and to be served the correct one.

From the Playground page:

Our default models’ training data cuts off in 2021, so they may not have knowledge of current events.

The models can also be used for data extraction ← which interests me.

There are reasons to increase the quality of the default model. As someone gave an example a few days ago, Albert Einstein didn’t had a brother who invented anything, but if you try it on the Playground is really difficult to get the truth.

I don’t contest the usefulness of the embeddings, but it depends on the the circumstances where the API is used.
For example if the user who interacts with the bot is a financial markets specialist will know how to communicate with it, otherwise the user will receive unexpected responses from the bot.

That sound like having an external aggregator to pull the data from the web into the facts lookup table.

Well, on my opinion “the knowledge” of the events here stands for “probabilities of tokens following the prompt”… which I do not consider as knowledge. The ai “knows” that the sky is blue, because the tokens for “blue” are more likely to be closer to tokens of “sky” than the tokens of “brown”.

Facts document parts given into the prompt (at least to me) seem to be closer to “knowledge” as of my understanding of it.

But I may be wrong.

Ok as we said the quality of the data, in this case your prompts / examples, are what is the issue

please allow me to demonstrate

  1. this is going to just xconfuse the AI

{“prompt”:"User: Tell me about BCD

First off this one question is fine, but make sure to ask this question as many different ways as possible as you can so it will “get” different contexts / semantic relationships / ways of asking the questions / nuances of the questions / semantic relationships

also

->“,“completion”:” BCD belongs to CyberSecurity. HQ is in ISR. Their business is related to Consumer Electronics,Hardware,Information Technology,Privacy and Security,Software. BCD is a breach and attack simulation platform that helps organizations verify their security posture. ###"}

PS: please forgive the typos, i assure you any prompts i write will not have those. But even typos are fine for the AI, it is confusing semantic relationships which render it non functioning

This is a bit of an update of this thread. It looks like you can do this without fine tuning. The latest version in Github uses embeddings only. It also works better using text-davinci-003 in my experience. So the latest version would use this approach with the latest ‘002’ ada embedding engine and ‘003’ of davinci. You could lower your prompt input sizes if you use a fine tune, but you lose a lot of flexibility if you do this since you can update the embedding vectors on the fly.

P.S. I’m not even sure fine tunes make sense with the latest approach, but I’d be curious if you think they do.

2 Likes