QA fine-tuned chatbot not answering from the trained data but nonfactual

And here you’re about to explode your operation costs…

2 Likes

A typical paragraph of fact looks like this (in portuguese):

Como deve ser feita a parametrização para o cálculo da antecipação do ICMS por empresa do Simples Nacional (Lei nº 18.241/2021)?
Esta solução é exclusiva para empresas do estado de Santa Catarina. Para que haja o cálculo, a empresa deve estar previamente configurada para o Simples Nacional e com os impostos ‘1 – ICMS’ e ’27 – ICMSA’ nos Parâmetros. Caso tenha dúvidas de como configurar a empresa, acesse a Solução Relacionada. ACUMULADOR 1 – Acesse o menu ARQUIVOS, clique em ACUMULADORES e localize ou crie um acumulador para operação de ENTRADAS; 2 – Na guia IMPOSTOS, informe os impostos ‘1 – ICMS’ e ’27 – ICMSA’, clique no botão ‘Definição […]’ e informe as respectivas definições; 3 – Clique no botão [Gravar] para concluir. LANÇAMENTO 1 – Acesse o menu MOVIMENTOS, clique em ENTRADAS e efetue o lançamento com acumulador configurado; OBS: Para cálculo o lançamento deve possuir CFOP iniciado com 2.XXX. 2 – Após o preenchimento dos valores, observe e alinha dos impostos e os respectivos valores dos mesmos. Exemplo: O valor do ICMSA foi de R$ 1.363,64 para chegar a este valor, o sistema efetuou o seguinte cálculo: OBS: Para chegar ao valor ‘Alíq. Interna’, sistema faz a divisão da alíquota ICMSA por 100. Ex: Alíquota de 12% - Desta forma: 12/100 = 0,12.

A smaller and much more helpful version would be

Empresas do Simples Nacional no estado de Santa Catarina devem estar previamente configuradas com impostos ‘1 - ICMS’ e ‘27 - ICMSA’ nos parâmetros. No menu ARQUIVOS, criar acumulador para operação de ENTRADAS, informando os impostos e definições. No menu MOVIMENTOS, lançamento com acumulador configurado, CFOP iniciado com 2.XXX. Valor do ICMSA calculado pela divisão da alíquota ICMSA por 100.

The later was generated by davinci. Unfortunately, I can’t use a human expert to summarize it. There’re thousands of facts just like that one in our db. You were right about costs though. I’ve just spent 50 USD of my companies’ budget on only a couple of hundreds of tests =)

I see… You need to cut those into smaller pieces, the one you gave here would be at least 6 pieces (cut by meaning), then, embed the pieces one by one (get vector) and use something like weaviate to store those as vectors. And query not only text of the paragraph but also the titles of it’s parent elements (you need to come up with a really good schema for weaviate db to classify those).

I use a fine-tuned model to format these type of paragraphs (sections I would say) and cut them into pieces by meaning. A bit expensive to train (starts with davinci), but gets cheaper as you can use results from davinci to train Curie.

This approach will (on my personal opinion) be the cheapest and most performant in the long run, especially if you have a lot of docs to store.

1 Like

Here is an example of how I’ve built such a formatter (sorry in English as I’m not writing Portuguese, however I can understand the subject and some details of what’s written, almost same as Spanish and French).

From the prompt you clearly see 6 sections this thing can be split into:

Title, applicable companies, pre-required, instructions for people in doubts, accumulation instructions, release instructions.

As you see, title is separated from body. And each paragraph (with standalone meaning) within the body is on its own separate line. Easy to parse by a simple script.

When you embed each section separately, make sure the title is passed as a parent name in your entities schema, so that you can search like:

how do I calculate icms for accumulation?

The passage you gave was reduced to:

This solution is exclusive for companies in the state of Santa Catarina. To calculate, the company must be configured for Simples Nacional with taxes ‘1 - ICMS’ and ‘27 - ICMSA’. An accumulator must be created for the DOWNPUT operation, and taxes must be noted and aligned in the INCOMES entry. The ICMSA rate is divided by 100 to reach the value ‘Alíq. Alíq. Interna’

Hi Rafael,

We are working on an idea that may work for you. We are using it for querying academic research papers - where we need several contexts from different parts of (sometimes several) documents to provide the final answer (The same applies with clause references in Law)

We are also doing this because our field has text in English, German, Portuguese, and French.

Instead of handling all three contexts in one go, we ask the question with the first context, and then we call the API again with the answer we receive from the first query along with the next context in the list. The second prompt asks the API to improve the original answer by considering the new context.

We repeat this process multiple times (you may only need two repeats) until we get the final answer. We have also used a final prompt to rewrite the combined text in a specific style of writing for academia.

This is working really well for us. It allows us to handle multiple source document languages. It also allows us to consider larger parts of the document without having to worry about breakpoints too much. (We are also doing some dicing and slicing of the source documents before we start)

I have sent you a private message on chat if you want a more detailed answer.

3 Likes

Love this approach, definitely will try. Thanks.

1 Like

You can also stuff two or three contexts into the first query. It doesn’t have to be 1-by-1. It depends on your token limits in the prompt part.

You may not be able to do it in Rafeals case, but we are also investigating getting the first calls to generate bullet lists for the answer - which we improve/add to over the 2nd and 3rd context. The final call gets GPT to convert the bullet list back into text.

This way you can stuff a lot more knowledge into your final prompt and it looks like it doesn’t dilute the result too much (Can’t verify this right now though)

1 Like

Personally, I failed with pure bullet points as sometimes they lose causality conjunction when you expand them. That’s why I had to create the “style remover” as it keeps those elements in (needed in my application).

Would love to see your results with bullet points.

I’ll get you some sample input and output in the next day or two. I’m interested to know if you felt it lost context making the bullet points, or putting it all back together again?

Use case was in copywriting where it was seen the most. Contracting text to bullet points and then expanding it back to normal text (approximate), often omitted the causality relation between the bullet points (they became a sequence of events rather than a list of consequences).

In legal usage it was something that could be ignored in most cases, but using it across the board, I judged too risky.

Maybe it was me that could not create good examples in English (my 3rd language out of 6), but, being from linguistics originally, I suspect this is something that cannot be fixed by definition.

2 Likes

Hi Raymond, that was clever!

You get the benefit of using as many facts as you need, working around the token limits, but keeping the costs almost the same, as the total number of tokens won’t change much. I’ll definitely try that one on my case and then report the results back to you.

Could you tell us which specific instructions you send to the bot to make it improve the previous answer? How does your prompt look like? And what models are you using? Have you considered mixing them in the process?

Tokens per request do not change… But the number of requests goes up (sometimes not linear)… So you can’t get away from increasing costs. However, this approach slows down the otherwise exponentially growing costs.

I would also note, that this approach is close to solving the major underlying problem of “operational memory focus” size and definitely worth pursuing as is very close to how humans think in similar situations (updating conclusion as new context arrives).

@raymonddavey have you guys tried to send asynchronous requests for each processable batch of contexts separately and then combine a “master request” containing sequence of following (per each previous request):

  1. Top level bullet points of contexts used
  2. Items considered to form the answer
  3. Answer based on the #1 and #2

Then followed by:

User query
Master prompt instructions
Final answer based on the reflexion above: (model answer goes here)

?

Would be interesting to test, especially in performance, as you don’t need to wait the chaining of the reflexion process.

1 Like

This will be deployed with a live client today and we should get fairly quick feedback. (Which I will post)

We are using streaming in a front end. So we show allow the user to select the number of passes (or contexts) they want to use. It is a parameter they set once.

We have an initial prompt, a followup prompt, and a completion prompt. The middle prompt will be called n-1 times

We use the initial prompt and then give them a button to enhance the answer. If they click the button, it runs the middle part multiple times and then does a final cleanup query

We stream the output so they can see the answer being built. Multiple passes can take some tie to complete. We show “pass x of y” at the top of the stream results until the final cleanup call.

It is fun to watch because you see it write something, then start over and produce an enhanced version etc

It certainly burns through tokens. But the benefit to the client was established through manual testing before writing the code.

In our case: This client pays for usage based on tokens plus a small margin

Big problem we faced: Token counts are not passed when you use streaming.

I’ll post the prompts we used in the post. I need to sanitise the prompt slightly to protect the clients identity.

2 Likes

I need to sanitize the prompts slightly to protect the IP of the client. Our prompts are slightly different - but you will get the idea.

Cut down and (slightly) sanitized prompt 1

Based on the text below write text on the topic described in the prompt. Mention evidence only if they are related to the topic described in the prompt. Avoid repetitions. Do not end with a conclusion or summary sentence.

"""
{context}
"""
Prompt: {searchterm}

Cut down and (slightly) sanitized prompt 2

Re-write CONTEXT 1 by including any new arguments from CONTEXT 2 on the topic described in the prompt. Mention evidence only if they are related to the topic described in the prompt. Avoid repetitions. Do not end with a conclusion or summary sentence.

"""
CONTEXT 1:
{previous}

CONTEXT 2:
{context}
"""
Prompt: {searchterm}

Final prompt (heavily sanitized)

Re-write the text below so that it reads as a blog article written by a New York Times reporter..

"""
{previous}
"""

We replace {previous} with the text derived from the previous step
We replace {searchterm} with the original question (we are q/a based)

We replace {context} with a single context within the claim - However, we often include 2 or 3 contexts in the first step (as many as will fit in the limit) This will often give us a good enough answer and we don’t need to spend the time and money to enhance - hence the reason the enhancement step is on a button.

We are using ajax calls in the javascript frontend with progress tracking to get the streamed reply. (A C# backend is driving this). When one streamed reply ends, we start the next one in the chain. We are still deciding if we will let the user decide to continue at the start of each step. At the moment it is automated - but you can’t stop it once it starts.

We are not sure what will happen when the OpenAI API fails midstream. We are finding that rate limits and errors are causing us issues - but that is another topic and unrelated. I mention it because it will be a problem in the middle of a long (expensive) chain if it has to be rerun.

3 Likes

We have another set of promtps that we are going to try that uses bullet points for step 1 to n. The final prompt will take the bullet points and try to rebuild the final text.

We know the prompts above work - so we will experiment a bit more

1 Like

Need some time to digest. Very useful. Thanks for sharing.

1 Like

Hey, only to feed back, I tried that solution (repeatedly asking to improve the final answer providing 1 fact at a time) and, in my particular case, I got results slightly better then providing the top 3 facts at once… However, as I mentioned, time response was a blow for me. Streaming to the user was nice though, as he/she can see the answer developing. Interesting to watch but I’m not sure if it is worth the extra time and money.

Hi sergeliatko, i read through your replies and i’m not sure enough with this step

  1. create a script to reformat your saved answers into seed.jsonl file with following format for each line:
    {“prompt”:“Bot description… Bla bla bla.\nFactual context: fact 1 the most relevant… Fact x the least relevant out of acceptable.\nUser current state: their mood\nConversation summary: bla bla bla…bla\nUser: What do you think of the company X in today’s situation?\nBot:”,“completion”:" the saved reply.<|endoftext|>"}

in the “completion”, do we need to separate the same question with different answers from step 6? Let’s say we have 50 questions with 2 answers each, Does It mean we will have 100 sample data training? Or what we need to do is just include all the answers to the same question so we will have a number of data training exactly the same as the questions?

Thank you anyway for sharing your guidance to combining the embedding + fine-tuning the models

Yes. It would be 2 prompts/completion lines containing one completion each. Sort of giving the model an option to choose between 2 best answers.

1 Like