Poor fine-tuning results of GPT 3.5

I’m training a chatbot to be used by a niche community of users. My issues are:

  1. I am not able to get the training loss to drop and consistently stay low.

  2. The validation and training loss oscillate so much that I don’t understand if its overfitting/underfitting.

My training set is 275 rows of back and forth conversational messages (5 or more messages in each conversation) and my validation is 30 rows of conversational messages. They are from the same set of synthetic conversations that I created. Seems high quality.

What am I doing wrong here? Is 275 + 30 number of rows too low for proper fine-tuning?

Should I train for longer epochs? I usually train them for 3-5 epochs, but have seen validation loss just shoot up if I train longer than that (means I am overfitting I guess).

Should I invest money and generate around 5k conversations? I don’t want to spend too much money only to find out that it doesn’t matter if it is 5k convo or even a 30k conversations set, and that there is something else that’s fundamentally wrong in the way I’m fine-tuning. So I’m asking for some insights here.

Here’s the training graph from my last fine-tuning run:

hi there - let me ask you a simple question: how does your fine-tuned model perform when you test it in practice?

Numbers don’t always tell the full truth, especially for lower size data sets.

Also, what are you fine-tuning for?

Good question. I’m fine-tuning GPT3 to be more conversational, sound like you’re talking to a friend, rather than a chat assistant, and also I’m trying to feed it some domain knowledge through these chat conversations.

The results are just ok. Compared to vanilla GPT3.5, its better. But it still doesn’t really answer a lot of questions that I’d like it to in a way that I want it to. And that’s what I was trying to do with these synthetic conversations.

So I wonder if I should just increase the number of synthetic convo to 5k or higher, hoping that higher number of input tokens in fine-tuning will fix the domain knowledge and the conversational tone issue.

Or if there’s something that I’m missing here, and no amount of synthetic data is going to fix the issue.

Ok. Fine-tuning is more appropriate for the first case, i.e. changing response style and behaviour. It is usually not suited for feeding domain knowledge. For the latter you would want to consider embeddings. So what you may need is a hybrid solution that combines the two.