QA fine-tuned chatbot not answering from the trained data but nonfactual

One strategy to get the so called “good examples” is to ask the bot to generate the prompts.
From an article or piece of information, the bot generates a set of questions and answers.
Use the set to create prompts and completions.

You can say that it is a sort of reverse psychology.

Here is such prompt ran in Playground:

Given the article below, create a JSON object which enumerates a set of 5 child objects.
Each child object has a property named “q”, a property named “a” and a property named “id”.
For each child object assign to the property named “q” a question which has its answer in the article and to the property named “a” a short answer to this question.
For each child object assign to the property named “id” the value “bloomberg-stocks-21-10-2022”.

The article: “The S&P Supercomposite Automobiles & Components Industry Index has declined nearly 38% this year, compared with the S&P 500 Index’s 22% drop.”

The JSON Object:

It will return a JSON object like the following:

[
  {
    "q": "What is the S&P Supercomposite Automobiles & Components Industry Index?",
    "a": "The S&P Supercomposite Automobiles & Components Industry Index is a stock market index.",
    "id": "bloomberg-stocks-21-10-2022"
  },
  {
    "q": "How much has the S&P Supercomposite Automobiles & Components Industry Index declined this year?",
    "a": "The S&P Supercomposite Automobiles & Components Industry Index has declined nearly 38% this year.",
    "id": "bloomberg-stocks-21-10-2022"
  },
  {
    "q": "How much has the S&P 500 Index declined this year?",
    "a": "The S&P 500 Index has declined 22% this year.",
    "id": "bloomberg-stocks-21-10-2022"
  },
  {
    "q": "What is the S&P 500 Index?",
    "a": "The S&P 500 Index is a stock market index.",
    "id": "bloomberg-stocks-21-10-2022"
  },
  {
    "q": "What is the difference between the S&P Supercomposite Automobiles & Components Industry Index and the S&P 500 Index?",
    "a": "The S&P Supercomposite Automobiles & Components Industry Index is a stock market index for the automobile industry while the S&P 500 Index is a stock market index for 500 large companies.",
    "id": "bloomberg-stocks-21-10-2022"
  }
]

If you set the temperature high, probably you will obtain a much diverse response from the bot, but still valid for the given text.

1 Like

Or you could just find something by googling: TOP 250+ Stock Market Interview Questions and Answers 21 October 2022 - Stock Market Interview Questions | Wisdom Jobs India

Then you can use it as a base and start from here.

Again, the fine tuning is to show the model the patterns in how to use prompt to answer the user question or statement, not to train it what to put into the answer. And it seems to me there are not many people who truly see the difference.

You can even craft your perfect answers using playground.

The trick is to get the first 5-7 perfect (LIKE PERFECTLY PERFECT). They will give the tone to your bot’s seed. Then keep being really picky for the next 50 replies. Make sure you have them saved.

Then out of 50 replies select 5-6 common directions, chose about 8-10 exemples in each direction and go deeper on the subject trying to show the bot the variations of reply manners and how to use prompt data to build the answer. Get about 25 replies in each direction.

Again stay picky on the quality of replies. They need to be the best you could do yourself.

Add replies to some stupid / provocative / unrelated / fabulous / insulting questions to show the bot how to answer those as well.

Save everything into a training file not forgetting to shuffle the lines.

When fine-tuning the generating model, do not forget the n_epoch parameter to be 2 (3 max I would say, but test)

Once the first fine tuning is done, you can do stage 1: run with a human feedback loop saving best answers too new training data (the goal is to get about 15 hundred before adding new testers and other blows and whistles).

Moderation filter is a must for a bot.

I would also add a “prompt hijack” filter (someone trying to talk your bot into bad things like forgetting previous prompt and starting cussing around)

Then the filters you might need for external tasks etc to give the bot external data.

Good thing to think about is saving a log of all conversations in case you need one day to implement a true memory for your bot. Just make sure the log format is automatically parsable for later use by scripts.

Those are things I would start with. Let me know if that helps.

@sergeliatko
Thank you for the reply. I am testing the use case that user ask about the company and the bot replays about the description of the company.

My training data(fact) looks like this. ABC, BCD are the company names and it has huge amount.

# Training Data
{"prompt":"Tell me about ABC ->","completion":" ABC belongs to Web3. HQ is in USA. Their business is related to Financial Services,Media and Entertainment,Other,Payments,Software. ABC is a blockchain technology company that develops NFTs and digital collectibles.\n"}
{"prompt":"Tell me about BCD ->","completion":" BCD belongs to CyberSecurity. HQ is in ISR. Their business is related to Consumer Electronics,Hardware,Information Technology,Privacy and Security,Software. BCD is a breach and attack simulation platform that helps organizations verify their security posture.\n"}

Firstly, I tried to use about 3K+ above sample data for fine-tune model which didn’t work well with curie, davinci as even asking the same prompt, it returns nonfactual response.

Secondary, I use embedding API to calculate similarity between the sample prompts and user’s input. Then provide top 3 similarity prompt and completion from training data. This is working fine so far. The steps are like this below.

1: Pre process training data in CSV to calculate embedding API. The columns of the file looks like this.

prompt,completion,babbage_similarity,babbage_search,...<Additional Columns>

2: When user asks, take the input string to calculate similarity against the pre-processed data , then select the top3 similar ones with prompt and completion for providing them to completion API request.

3: Build completion API request. The structure of the request content is :

<Prefix String>

<3 Training Context selected from Embedding API similarity against user's input>

<User's Input>

The actual request looks like this below. Parts surrounded by <> are not included in actual requests.

<Prefix String>
The following is a conversation with an AI assistant called BOT. BOT is helpful, creative, clever, and very friendly. If you ask BOT a question that is rooted in truth, BOT will give you the answer. If you ask BOT a question that is nonsense, trickery, or has no clear answer, I will respond with "Sorry, I am not sure. I will learn more.".\n\n

<3 Training Context selected from Embedding API similarity against user's input>
User: Tell me about ABC ->
BOT: ABC belongs to Web3. HQ is in USA. Their business is related to Financial Services,Media and Entertainment,Other,Payments,Software. ABC is a blockchain technology company that develops NFTs and digital collectibles. ###

User: Tell me about BC ->
BOT: BC belongs to FinTech. HQ is in USA. Their business is related to Financial Services,Media and Entertainment,Other,Payments,Software. BC is a payment technology company that develops banking solutions. ###

User: Tell me about valuation of ABC ->
BOT: ABC has valuation of 100M$ ###

<User's Input>
User: Tell me about ABC ->

So, using prompt for embedding is to select what to include in the completion API request as training data. I wonder what would work better. If you have any suggestions, I really appreciate it.

BTW

“The S&P Supercomposite Automobiles & Components Industry Index is a stock market index.” - personally I would consider the quality of this answer somewhere below 10% as it breaks one of the definition fundamentals: never define a concept using the same concept in the definition.

Also, the definition of this index cannot be extracted from the provided article… As it is not there.

@joshbachynski
Thank you for the helpful advice!

yes if you want the model to respond with it’s strongest correlations (ie your “truth”) then reduce temperature to 0. That will be the “truthiest” it can get.

Yes. I have tried from 0 to 1 and now I’ve been using 0 for temperature as it looks giving better response.

Fine tuning does not work so well, 1000 examples is not enough. 10k is more what you need. When you say it is giving non-factual answers this is impossible. “Truth” in this context is what it has been trained to say, so it cannot lie. Do you mean it is not saying things it should have been trained to say? then your training data has contradictions in it. It says X is both Y and Z, so that is the cause of the variation in answer. Again see 1-2. Or…

My use case is that the user ask about the company and bot replies about the company. One sample is about one company.

prompt: Tell me about <Company Name>
completion: <Company Name> belongs to XTech segment. HQ is in <Country>. Their business is related to <Categories they blong to>. <Short description of the company>

My fine-tune model with curie,davinci replies something mix up with other samples.

# Sample privided for fine-tune
{"prompt":"User: Tell me about BCD ->","completion":" BCD belongs to CyberSecurity. HQ is in ISR. Their business is related to Consumer Electronics,Hardware,Information Technology,Privacy and Security,Software. BCD is a breach and attack simulation platform that helps organizations verify their security posture. ###"}

If user asks exactly same prompt, it replies something mix up with others

User: Tell me about BCD
BOT:  BCD is a Cynefin-based AI company that helps companies make sense of their data.
# I expect something similar to the pre-trained completion one, but the answer is something nonfactual

you either need to block any questions on topics you do not want to / cannot talk about well or

I still need to experiment what to block to make the response solid … So far I don’t come up with any word to block but after trying it many times, there might be ones needs to be blocked.

you might get better results if you use a davinci2 prompt for the truthful chatting (can show you how to prompt for this) with a semantic search dynamic prompt of your company database info

Noted. I tried curie and davinci2 based model and now I am using text-davinci-002 with embedding approach.

it might want to ask clarifying questions as well

My prompts mostly starts from “Tell me about ” and so far with embedding API approach to measure similarity can handle questions like below prompts. Mostly due to many of the company name is identical and if the company name is in general, it probably will not work.
“Describe about ”
“Please explain about ” etc

In essence, you need to make something more self-aware. Which is what i have done.

self-aware is a very interesting and difficult topic. I am checking other QA for a deeper understanding. With my limited knowledge, Transformer and Attention are relations between words in vector to measure the most likely connected word in the language model, I am trying to understand how self-awareness fits in actual practice.

Ok I see your problem. Sorry being direct, but you’re still not getting the idea of fine-running nor the embedding.

1 Embedding - helps you find most relevant text or of many texts (facts only in your application)
2 Fine tuning - helps you adjust the manner and patterns used by a model to complete a given response.

Here is how I would do the thing.

1 for each company, create a file with all facts your bot needs to know about. Simple txt file would do.
2 cut your files into rather small pieces (about 2-3 paragraphs) with text similar to:
Company: Company name
Data: your facts and other related text
3 embed reach piece of text with text-search-ada-doc (to start) and save into your database along with the piece text without Company: Company name part.

Now start seeding the bot:

  1. Chose your first question (usually the most common but rather complicated):

What do you think of the company X in today’s situation?

(Chose another one if you have no means to tell your bot what the today’s situation is)

  1. Embedd that question, using text-search-ada-query (I guess it is better to have same embedding base model, but please correct me if I’m wrong)

  2. Compare the vector of the question with your facts in database to get several most relevant facts sorted by relevancy desc.

  3. Copy the text of those several facts without the Company: Company name part. And start creating your prompt:

Bot description… Bla bla bla.
Factual context: fact 1 the most relevant… Fact x the least relevant out of acceptable.
User current state: their mood
Conversion summary: bla bla bla…bla
User: What do you think of the company X in today’s situation?
Bot:

  1. Take the prompt from step 4 and paste it into the playground (adjust settings to your will)

  2. Run/replay a completion several times and save bots best answers (feel free to adjust/edit them manually before saving) to a separate file like this:
    —prompt—
    Bot description… Bla bla bla.
    Factual context: fact 1 the most relevant… Fact x the least relevant out of acceptable.
    User current state: their mood
    Conversation summary: bla bla bla…bla
    User: What do you think of the company X in today’s situation?
    Bot:
    —completions—
    Bot’s answer 1

    Bot’s answer 2

    Etc.

  3. Change your question and restart from step 1 until you have at least 50 questions with 2-4 answers each.

8 create a script to reformat your saved answers into seed.jsonl file with following format for each line:

{“prompt”:“Bot description… Bla bla bla.\nFactual context: fact 1 the most relevant… Fact x the least relevant out of acceptable.\nUser current state: their mood\nConversation summary: bla bla bla…bla\nUser: What do you think of the company X in today’s situation?\nBot:”,“completion”:" the saved reply.<|endoftext|>"}

Make sure it is a valid json and reach object is on its own line (or your model will “break” most likely)

  1. Once all the lines are saved it would be nice to shuffle the lines randomly to break the “flow” (use a script for that).

10.Create a fine tune using davinci as basse model and n_epoch 2

Now, populating real training data.

  1. Create an app with question input field, question submit button, reply edit field and save button.

Submitting question sends the workflow to step 2 of the previous section.

On step 4, print the created prompt on the screen and send the prompt to the API to get bot answer, put the answer into edit answer field.

  1. Edit the answer if necessary

  2. When happy, click save button. On save button your app should save exactly one line as in step 8 of previous section (make sure you save the edited answer)

  3. have fun until you get at least 1500 answers

  4. create a new fine tune “my-bot-v1-0” based on your model (again, n_epoch 2 or test)

  5. Every 1000 answers upgrade your fine tuned model from “my-bot-v1-y” to “my-bot-v1-z” by feeding new data only

@georgei

One strategy to get the so called “good examples” is to ask the bot to generate the prompts.
From an article or piece of information, the bot generates a set of questions and answers.
Use the set to create prompts and completions.

Wow, it is very interesting and surprised me with the capability of the model again. Probably, when it comes to the detail, we need a huge effort, but I can play with it for a few hours to find quirks in the model :smiley:

@sergeliatko

the fine tuning is to show the model the patterns in how to use prompt to answer the user question or statement, not to train it what to put into the answer.

Thank you. It is clear that what is fine-tune model for! And the practical approach to refining the model is very helpful!

The stocks market example I gave above it was something random from Bloomberg.
In my use case the content is absolutely unpredictable. I can have a few hints of what the content type might be, but that’s all.
What is more important, and is also written in the fine-tuning documentation, to have up to date information and to be served the correct one.

From the Playground page:

Our default models’ training data cuts off in 2021, so they may not have knowledge of current events.

The models can also be used for data extraction ← which interests me.

There are reasons to increase the quality of the default model. As someone gave an example a few days ago, Albert Einstein didn’t had a brother who invented anything, but if you try it on the Playground is really difficult to get the truth.

I don’t contest the usefulness of the embeddings, but it depends on the the circumstances where the API is used.
For example if the user who interacts with the bot is a financial markets specialist will know how to communicate with it, otherwise the user will receive unexpected responses from the bot.

That sound like having an external aggregator to pull the data from the web into the facts lookup table.

Well, on my opinion “the knowledge” of the events here stands for “probabilities of tokens following the prompt”… which I do not consider as knowledge. The ai “knows” that the sky is blue, because the tokens for “blue” are more likely to be closer to tokens of “sky” than the tokens of “brown”.

Facts document parts given into the prompt (at least to me) seem to be closer to “knowledge” as of my understanding of it.

But I may be wrong.

Ok as we said the quality of the data, in this case your prompts / examples, are what is the issue

please allow me to demonstrate

  1. this is going to just xconfuse the AI

{“prompt”:"User: Tell me about BCD

First off this one question is fine, but make sure to ask this question as many different ways as possible as you can so it will “get” different contexts / semantic relationships / ways of asking the questions / nuances of the questions / semantic relationships

also

->“,“completion”:” BCD belongs to CyberSecurity. HQ is in ISR. Their business is related to Consumer Electronics,Hardware,Information Technology,Privacy and Security,Software. BCD is a breach and attack simulation platform that helps organizations verify their security posture. ###"}

PS: please forgive the typos, i assure you any prompts i write will not have those. But even typos are fine for the AI, it is confusing semantic relationships which render it non functioning

This is a bit of an update of this thread. It looks like you can do this without fine tuning. The latest version in Github uses embeddings only. It also works better using text-davinci-003 in my experience. So the latest version would use this approach with the latest ‘002’ ada embedding engine and ‘003’ of davinci. You could lower your prompt input sizes if you use a fine tune, but you lose a lot of flexibility if you do this since you can update the embedding vectors on the fly.

P.S. I’m not even sure fine tunes make sense with the latest approach, but I’d be curious if you think they do.

2 Likes

Hi everyone,
As many people here, I’m struggling to make the model respond using knowledge gained from my facts. As helpfully suggested by @sergeliatko , I tried to embed facts, choose the top 3 candidates and send them as context along with the prompt itself against text-davinci-003. It works… But the thing is… my facts are quite long, and I often exceed the 4,000 tokens limit imposed by the model. Since I can’t fine-tune with my facts, and I can’t send them with the prompt, where is the alternative? Any thoughts?
Thanks a lot.

1 Like

In legal ai project I have right now, I have a similar problem: context is too long to fit into the prompt (especially on fine-tuned models). What I’m doing is (depending the fact nature) one of the following or a combination of those:

0 always ask myself if it isn’t easier to reformulate the query so that it doesn’t need those long contexts (breaking the complex query into smaller parts, processing them separately using shorter context chunks, get answers, send answers from each as context to the initial query). Most of the cases this is the only “right” thing to do…

If not, then:

1 when the context can be further broken down into smaller pieces, I definitely do that. Targeting some 200 tokens ideally per chunk.

2 if not really possible to reduce the context into parts, I try to reduce it by removing “text style” (recursively reformulating the long text phrases into facts-only short sentences), often it gives really good results.

3 if the #2 doesn’t work. I run it again through a different model to summarize the facts-only text into shorter text or even outline only. Losing precision here, so need to tell that to the answering model to mention to the user in the reply so that they reformulate the question. (Forcing the user to focus on details to be able answer simpler questions)

4 plan to have a knowledge base archive with long articles accessible by a link given as an answer.

In any case, when a short fact (yes, the game is to have the facts as short as possible) is selected by the cosine similarity, I find it’s exact position in the original article and have my “magic” function return me the text surrounding the fact to fill into the available tokens in order not to miss some details.

Hope this helps.

1 Like

Here is an example from travel industry on how the “text style” can be removed to get facts only:

I used that to train fine-tuned model to reduce the text to facts only.

@sergeliatko You helped a lot. In order to shorten facts I ran every paragraph against davinci-003 asking for summarization and stored them. I didn’t even use cosine similarity so far, just ordered them using common keywords and picked the top 5 candidates. Then I build prompts as you suggested. It’s far from “on point” yet but seems the right direction

Curious of what your unprocessed “fact” looks like.

And here you’re about to explode your operation costs…

2 Likes

A typical paragraph of fact looks like this (in portuguese):

Como deve ser feita a parametrização para o cálculo da antecipação do ICMS por empresa do Simples Nacional (Lei nº 18.241/2021)?
Esta solução é exclusiva para empresas do estado de Santa Catarina. Para que haja o cálculo, a empresa deve estar previamente configurada para o Simples Nacional e com os impostos ‘1 – ICMS’ e ’27 – ICMSA’ nos Parâmetros. Caso tenha dúvidas de como configurar a empresa, acesse a Solução Relacionada. ACUMULADOR 1 – Acesse o menu ARQUIVOS, clique em ACUMULADORES e localize ou crie um acumulador para operação de ENTRADAS; 2 – Na guia IMPOSTOS, informe os impostos ‘1 – ICMS’ e ’27 – ICMSA’, clique no botão ‘Definição […]’ e informe as respectivas definições; 3 – Clique no botão [Gravar] para concluir. LANÇAMENTO 1 – Acesse o menu MOVIMENTOS, clique em ENTRADAS e efetue o lançamento com acumulador configurado; OBS: Para cálculo o lançamento deve possuir CFOP iniciado com 2.XXX. 2 – Após o preenchimento dos valores, observe e alinha dos impostos e os respectivos valores dos mesmos. Exemplo: O valor do ICMSA foi de R$ 1.363,64 para chegar a este valor, o sistema efetuou o seguinte cálculo: OBS: Para chegar ao valor ‘Alíq. Interna’, sistema faz a divisão da alíquota ICMSA por 100. Ex: Alíquota de 12% - Desta forma: 12/100 = 0,12.

A smaller and much more helpful version would be

Empresas do Simples Nacional no estado de Santa Catarina devem estar previamente configuradas com impostos ‘1 - ICMS’ e ‘27 - ICMSA’ nos parâmetros. No menu ARQUIVOS, criar acumulador para operação de ENTRADAS, informando os impostos e definições. No menu MOVIMENTOS, lançamento com acumulador configurado, CFOP iniciado com 2.XXX. Valor do ICMSA calculado pela divisão da alíquota ICMSA por 100.

The later was generated by davinci. Unfortunately, I can’t use a human expert to summarize it. There’re thousands of facts just like that one in our db. You were right about costs though. I’ve just spent 50 USD of my companies’ budget on only a couple of hundreds of tests =)