What to do when fine-tuning is not working?

I just started a new experiment to build a chatbot around the general information around a tech conference: schedule, talks, speakers, etc. This is a chatbot trained in a very specific domain that was not available during GPT3 training and is mainly factual.

I was able to create some prompts and create my first fine-tuned model but after some tests I am realising it’s not working at all as I was expecting.

I would like to know what different approaches would more experienced users would recommend me to try.


I just did this cover letter chatbot yesterday without finetuning

1 Like

Excellent! I was more after potential solutions to failing fine-tuning experiments but this is also an interesting take where you gather the knowledge base as you go and do a final transformation to get what you want. I could potentially add all my facts as a header but then it would be quite restrictive and heavy per request. I guess I could use this as a last resort if every other fine-tuning approach fails.

1 Like

I learned through experimentation that fine-tuning does not teach GPT-3 a knowledge base. The consensus approach for Q&A which various people are using is to embed your text in chunks (done once in advance), and then on the fly (1) embed the query, (2) compare the query to your chunks, (3) get the best n chunks in terms of semantic similarity, (4) send the query and those chunks to the davinci-text-002 endpoint along with instructions like “please answer the question based on this information.”

It would be great if there was a dedicated question-answering endpoint or “content-limited” endpoint. I don’t think fine-tuning really provides meaningful new topical knowledge to GPT-3, instead it mostly tells GPT-3 the desired style and format of the completion (@daveshapautomator do you agree?) Most importantly, fine-tuning doesn’t automatically restrict GPT-3 to answering from a certain knowledge base.


Sorry for hijacking this post, but…

I’m doing a variation of this in my QA bot, but I think I’m missing some key parts because my queries are still pretty big, especially when it needs to query a document.
Can you ELI5 this for me, aka show as examples?
I would greatly appreciate it!

1 Like

Do you mean the user’s query is big, or your prompt (which includes the user’s query) is big? When you say “especially when it needs to query a document” what exactly to do you meanm i.e. what else besides a document would you be querying? It could be that you need to re-read the documentation to learn how to set up your text for embeddings based search followed by completions. There are also several videos including one by @daveshapautomator that I believe several people have used.

Thanks for the answer @lmccallum

Currently I’m not using embeddings or whatever it’s called (do I need those?)

What I am doing currently with my QA bot is basically (simplified):

  1. User asks bot: “How is the xyz fee processed?”
  2. Bot takes the whole conversation (every time) and classifies it with davinci (to get good context), like:
Classify following into these 3 categories: GREETINGS, XYZ, OTHERS
2. Wanna ask about xyz <XYZ>
3. Hows the fee processed? <XYZ>
... about 15 examples of context ...

Not classified:
1. ... Here I paste the last 15 lines of the chat for it to classify. I just need the last message, but the more messages, the better it can detect context ...

1.  ... Here it begins autocompleting and numbering up, works well enough
  1. If it classified the LAST chat message as <XYZ>, for example, it creates a new prompt (loading the pasted document from a XYZ.txt file) like:
At xyz, we process the fee like xyz. Also we do x, y and z. Our service... etc etc giant document, takes ***2.5K*** tokens.

Based on the document above, answer following question:

A conversation between a client and Jane. Etc etc setting the tone. ...This is a header snippet file, used throughout the project...

... Here I paste the last 5 chat messages for context ...
Jane: ...Here the bot autocompletes the answer, works extremely well...
  1. Get the answer from the last Jane above and return it to the user. Since it’s given the tone in step 3, together with the document to answer, it maintains the conversation’s tone pretty well and fits in perfectly.

This works pretty well and allows me to query multiple documents, depending on what’s being talked, but consumes an abismal amount of tokens. Any help is greatly appreciated.

And yes, I’ve watched multiple of davids videos, they’re awesome and inspired me of this architecture.

How the files really look

Of course my files look like:

{{ doc }}
Based on the document above, answer following question:

{{ header }}

{{ chat }}

But to simplify things, I’ve commented everything in the post. Also, the prompts are not exactly in the same words as the ones I use. I won’t open all of them now just for the sake of demonstration ahaha What remains the same, however, is the architecture and the structure in which I did it.

Hi, I am not very familiar with the chat bot use case, but I think you’d want to (1) in advance, divide your documents into smaller pieces and obtain their embeddings, (2) if the last chat message is about XYZ, then get the embedding for that message (the query), (3) compare the query embedding to the document embeddings using cosine similarity to find the top n semantically similar matches, (4) use the top n matches, instead of the whole document, as part of your next prompt in the chat bot.


Thanks for taking your time to help me!

Can you guide me in the right direction with this? Is there any video, ressource or example that I could look at?

Should I gave up trying to train davinci in new factual information or specific domain knowledge and use the approach mentioned in this thread of bringing the facts manually as part of the prompt in a summary overview style? Does not that mean that the fine-tuning is actually not working for extending domain specific information?

These resources should help:

1 Like

Yes I think so. Hopefully others will weigh in.

Based on my experiments, I agree with @lmccallum. I found fine-tuning to work well in encoding/internalizing few-shot learning into a fine-tuned model. I haven’t tried it yet, but I suspect it would also do well in lending a “voice” to the model.

Because of some of the posts showing impressive results fine-tuning with a relatively small dataset, I figured it’s worth a shot to try fine-tuning to encode/internalize a context. But, to be honest, I wasn’t really expecting it to work and, to be fair, I think it makes sense for it not to—fine-tuning probably works on tweaking the outer layers of the model, but to encode/internalize context like the stock model does it stands to reason that you’d need a very large dataset and to train the whole model, not just fine-tune the outer layers.

My mini-dilemma now is figuring out whether it’s worth optimizing the dynamic prompt generation process if GPT-4 ends up being released soon, rendering all of that obsolete :slight_smile:

Not to hijack the thread, but I have a similar use case and was wondering if anyone has any insights that may be relevant.

My chatbot speaks to the user, asks a series of questions, and is then supposed to output a summary of the conversation that details whether certain topics were covered. The general flow of a conversation is like this:

Topics: a,b,c
User: message
Bot: what about A?
Bot: what about C?
User: message
Bot: Ok, your summary is: {a,b,c}

When I do this with Prompt Engineering by giving 3 examples of such conversations, GPT-3 reliably steers the conversation towards answering those questions and eventually outputs the required summary data.

However, this input prompt is quite long (2k+ tokens) so I wanted to try fine tuning it. After doing so, it doesn’t seem to reliably work anymore - GPT-3 seems to ask about the correct topics but “forgets” that it has already asked certain questions and “forgets” that it is supposed to output a summary after asking all the questions. I prepared my fine-tuning data by using 15 “correct” conversations (as opposed to the 3 “correct conversations” in my original prompt), so am a bit to see the performance be this much worse. These 15 conversations represent ~200 prompt/completion pairs that I used for fine tuning.

I conclude that there is no clear answer to fine-tuning for extending the model knowledge. I trained a Davinci model in the playground and it was working fantastically, the very same model as a fine tune failed drastically. So my question is, what is the role of fine-tuning at all?


I think the best approach is to bring whatever context necessary to your prompt (search, overview, session, etc)… fine tuning won’t bring new knowledge at least on my tests…

Some people got results by setting the temperature to zero, adding unique ids to the prompts and adding more prompts (a combination of all 3 strategies). I am trying that and will let you know, IMO openai is designed to add knowledge via the fine-tuning if done properly.

1 Like

My understanding is that fine tuning is best to avoid repetitive contextual/customised prompts and not at all for adding new knowledge.

This makes sense as the training is only happening in the last hidden layers and not the whole neural network.

Note that once you get an undesired result it may give you a lot of trouble if you can’t deliver reliable outputs to your stakeholders.

What I mentioned works 100%. Fine tuning for new knowledge success rate is very low from the feedback in this community and my own experience.


Is there a way to perform “fine tuning” in a playground environment? Such as, you can add a couple items, then prompt to see what it will do? Or do you have to go to the trouble of creating the jsonl file and uploading it every single time you want to test something different?

I am working on a chatbot as well for a specific domain. My question is what is your approach to ‘bring whatever context necessary to your prompt’? Are you using chat gpt to guide you and filter what’s being asked? Or just searching the user’s query text for specific keywords? Any detail you can give to your process would be very appreciated…