What to do when fine-tuning is not working?

I just started a new experiment to build a chatbot around the general information around a tech conference: schedule, talks, speakers, etc. This is a chatbot trained in a very specific domain that was not available during GPT3 training and is mainly factual.

I was able to create some prompts and create my first fine-tuned model but after some tests I am realising it’s not working at all as I was expecting.

I would like to know what different approaches would more experienced users would recommend me to try.


I just did this cover letter chatbot yesterday without finetuning

1 Like

Excellent! I was more after potential solutions to failing fine-tuning experiments but this is also an interesting take where you gather the knowledge base as you go and do a final transformation to get what you want. I could potentially add all my facts as a header but then it would be quite restrictive and heavy per request. I guess I could use this as a last resort if every other fine-tuning approach fails.

1 Like

I learned through experimentation that fine-tuning does not teach GPT-3 a knowledge base. The consensus approach for Q&A which various people are using is to embed your text in chunks (done once in advance), and then on the fly (1) embed the query, (2) compare the query to your chunks, (3) get the best n chunks in terms of semantic similarity, (4) send the query and those chunks to the davinci-text-002 endpoint along with instructions like “please answer the question based on this information.”

It would be great if there was a dedicated question-answering endpoint or “content-limited” endpoint. I don’t think fine-tuning really provides meaningful new topical knowledge to GPT-3, instead it mostly tells GPT-3 the desired style and format of the completion (@daveshapautomator do you agree?) Most importantly, fine-tuning doesn’t automatically restrict GPT-3 to answering from a certain knowledge base.


Sorry for hijacking this post, but…

I’m doing a variation of this in my QA bot, but I think I’m missing some key parts because my queries are still pretty big, especially when it needs to query a document.
Can you ELI5 this for me, aka show as examples?
I would greatly appreciate it!

Do you mean the user’s query is big, or your prompt (which includes the user’s query) is big? When you say “especially when it needs to query a document” what exactly to do you meanm i.e. what else besides a document would you be querying? It could be that you need to re-read the documentation to learn how to set up your text for embeddings based search followed by completions. There are also several videos including one by @daveshapautomator that I believe several people have used.

Thanks for the answer @lmccallum

Currently I’m not using embeddings or whatever it’s called (do I need those?)

What I am doing currently with my QA bot is basically (simplified):

  1. User asks bot: “How is the xyz fee processed?”
  2. Bot takes the whole conversation (every time) and classifies it with davinci (to get good context), like:
Classify following into these 3 categories: GREETINGS, XYZ, OTHERS
2. Wanna ask about xyz <XYZ>
3. Hows the fee processed? <XYZ>
... about 15 examples of context ...

Not classified:
1. ... Here I paste the last 15 lines of the chat for it to classify. I just need the last message, but the more messages, the better it can detect context ...

1.  ... Here it begins autocompleting and numbering up, works well enough
  1. If it classified the LAST chat message as <XYZ>, for example, it creates a new prompt (loading the pasted document from a XYZ.txt file) like:
At xyz, we process the fee like xyz. Also we do x, y and z. Our service... etc etc giant document, takes ***2.5K*** tokens.

Based on the document above, answer following question:

A conversation between a client and Jane. Etc etc setting the tone. ...This is a header snippet file, used throughout the project...

... Here I paste the last 5 chat messages for context ...
Jane: ...Here the bot autocompletes the answer, works extremely well...
  1. Get the answer from the last Jane above and return it to the user. Since it’s given the tone in step 3, together with the document to answer, it maintains the conversation’s tone pretty well and fits in perfectly.

This works pretty well and allows me to query multiple documents, depending on what’s being talked, but consumes an abismal amount of tokens. Any help is greatly appreciated.

And yes, I’ve watched multiple of davids videos, they’re awesome and inspired me of this architecture.

How the files really look

Of course my files look like:

{{ doc }}
Based on the document above, answer following question:

{{ header }}

{{ chat }}

But to simplify things, I’ve commented everything in the post. Also, the prompts are not exactly in the same words as the ones I use. I won’t open all of them now just for the sake of demonstration ahaha What remains the same, however, is the architecture and the structure in which I did it.

Hi, I am not very familiar with the chat bot use case, but I think you’d want to (1) in advance, divide your documents into smaller pieces and obtain their embeddings, (2) if the last chat message is about XYZ, then get the embedding for that message (the query), (3) compare the query embedding to the document embeddings using cosine similarity to find the top n semantically similar matches, (4) use the top n matches, instead of the whole document, as part of your next prompt in the chat bot.


Thanks for taking your time to help me!

Can you guide me in the right direction with this? Is there any video, ressource or example that I could look at?

Should I gave up trying to train davinci in new factual information or specific domain knowledge and use the approach mentioned in this thread of bringing the facts manually as part of the prompt in a summary overview style? Does not that mean that the fine-tuning is actually not working for extending domain specific information?

These resources should help:

1 Like

Yes I think so. Hopefully others will weigh in.

Based on my experiments, I agree with @lmccallum. I found fine-tuning to work well in encoding/internalizing few-shot learning into a fine-tuned model. I haven’t tried it yet, but I suspect it would also do well in lending a “voice” to the model.

Because of some of the posts showing impressive results fine-tuning with a relatively small dataset, I figured it’s worth a shot to try fine-tuning to encode/internalize a context. But, to be honest, I wasn’t really expecting it to work and, to be fair, I think it makes sense for it not to—fine-tuning probably works on tweaking the outer layers of the model, but to encode/internalize context like the stock model does it stands to reason that you’d need a very large dataset and to train the whole model, not just fine-tune the outer layers.

My mini-dilemma now is figuring out whether it’s worth optimizing the dynamic prompt generation process if GPT-4 ends up being released soon, rendering all of that obsolete :slight_smile: