Fine Tuned Chatbot forgets how to output summary of conversation

Hi there! I’m building a chat bot that asks a series of questions and then outputs a summary of the conversation with a summarized version of the user’s response to each question. The general flow of a conversation is like this:

Questions: A,B,C
User: message
Bot: A?
Bot: C?
User: message
Bot: Ok, your summary is: {a,b,c}

When I do this with Prompt Engineering by giving 3 examples of such conversations, text-davinci-002 reliably steers the conversation towards answering those questions and eventually outputs the required summary data.

However, this input prompt is quite long (2k+ tokens) so I wanted to try fine tuning it. After doing so, it doesn’t seem to reliably work anymore - GPT-3 seems to ask about the correct topics but “forgets” that it has already asked certain questions and “forgets” that it is supposed to output a summary after asking all the questions. I end up with conversations like this:

Questions: A,B,C
User: message
Bot: A?
Bot: A?
User: message
Bot: cool, B?
(note the lack of summary + asking the same question multiple times)

I prepared my fine-tuning data by using 15 “correct” conversations (as opposed to the 3 “correct conversations” in my original prompt), so am a bit to see the performance be this much worse. These 15 conversations represent ~200 prompt/completion pairs that I used for fine tuning. I am aiming for ~500, but these take some time to compile.

Each prompt/completion pair is formatted like this:
“prompt”:“[questions] + \n + [chat log so far} + \n + Bot:”,
“completion”: “[response from a real text-davinci-002 conversation]”

The main questions I have are:

  1. Should I start by scaling up my fine-tuning data set or by experimenting with ways to pre-process my data?
  2. How might varying the learning-rate-multiplier and/or num_epochs affect things in this use case? I seem to have gotten slightly better results after dropping the num_epochs, though they are all far from what I get using text-davinci-002. Are there other tweaks I should apply when fine-tuning?

Something similar happened to me, i had to make a chatbot that was trained on a large text of pharmaceutical data for a company i was making a service to.

What happened was strange: it seemed that the model completely forgot all his previous knowledge and substitute it with the new, it generated coherent text with the questions, but organized in a very bad and nonsensical way. I was saying what a **** :rofl:

then i understood why, that’s because when you finetune you do it on the “davinci” model, that is not the current “text-davinci-003” or 002 or even 001, it’s the base of the model they used to build that models.

Davinci is the “non-instruction” based model, or equivalently it’s the 001 model but without fine tuning for instruction following tasks and without prompt based supervised learning and optimization. Text Davinci , differs from the base model Davinci because it was specially fine tuned for the use case of “instruction following”, like if you write answer he will understand what means to “answer”. Or how many lines to write and then stop etc.

What can you do? Im not a total expert so i can’t assure you it would work , but you can try to put many many prompts to fine tune the model to follow instructions. (you would need at least 10k prompt completion pairs i think) with very variable content. After that, the model will become better at instruction following and then you can fine tune it on the specific data you need.

Probably you can find on the internet some data for instruction following training.

Let me know if you do it and it works, because i wanted to try this too! thanks


this is the same conclusion I came to - fine tuning just doesn’t work in a situation like this (at least without an unreasonable amount of training data).

1 Like

How big was your dataset? I believe OpenAI recommends doubling if your first dataset doesn’t work. So, if you had 200 examples (bare minimum), you’d want to try 400+ the next go. Something like this, though, you might well need 2,000+ examples. The more the better…

Got up to around 1k, though I honestly didn’t see a meaningful boost in performance as I scaled up from 200 → 400 → 600 → 800 → 1000

Maybe another thousand is all I’m missing but who knows? results so far haven’t been TOO promising…

1 Like

I did one for DND character backstories with just over 1,000 examples, I believe. I’ve got 2,700+ examples now, but I’ve not had the time or money to fine-tune again. And, personally, the cheaper Davinci model with two-shot examples performs about as good maybe a bit better?

How did you set-up you prompt/completion in the dataset? Maybe we can improve on that a bit too?

1 Like

From what I have seen, most people playing around with the Fine Tuning API have the same issues.
The better solution is to use semantic search to find the most relevant context, limit it within the size of the token length and use prompt engineering along with the questions and answering.


I hear you re: time and money for fine tuning not being worth

That’s pretty much the solution I ended up going with for

Agree with the comment Nelson, regarding semantic search,

We’re testing ChatGPT-3 vs ChaGPT-4 (preview) on US FDA content related queries.
We’ve come to the conclusion that a fine tuned OpenAI GPT-x is best trained on FDA content we have collected, ie. Guidance PDFs and 21 CFRs (Code of Federal Regulations) and geared for semantic search.

1 Like