Avoid overfitting during the fine-tuning of gpt-3.5 turbo

I have fine-tuned a gpt-3.5 turbo model with a small dataset, containing the examples of a conversation. I wanted to have the new model chatting in the style of the examples I have provided, but also to be able to use the general gpt knowledge, to naturally handle potential conversation segues.
However, the fine-tuned model was overfitted and so strict, and “forgot” all the natural ways of handling a conversation.
Is there a way to avoid this kind of overfitting and, for example, only fine-tune the last layers of model? Or am I missing the right approach to deal with this kind of problem?

Good fine-tune simply cannot use “a small dataset”. So it depends on what you actually mean by that in terms of the thousands or hundreds of thousands that are used to align base models.

OpenAI has set up adaptive learning rates depending on the training file size. Otherwise your 100 would just be added to their 10 million. The one hyperparameter you have control over is n_epochs.

You can review the job object report and see how many epochs were used if you didn’t specify, and run again from the start with 1/3. Then “continue fine-tune”, specifying that existing model, would let you add just another epoch at a time.

A validation file, some held out questions from your training, can also be supplied to get a generalization of the model’s abilities. You could instead fill it with out-of-domain questions if that is what you instead want to see graphed after your training is done.

Thank you for your fast answer @_j .
I agree a “small dataset” is not enough for the good fine-tuning, but it should be enough to give me a glimpse of what is going on under the hub, since so tiny level of control is provided with the current API.
You mentioned “Otherwise your 100 would just be added to their 10 million”. However, I think the exact opposite happened for me, and the fine-tuning just used my hundreds of examples to fine-tune the whole model (not just the last layers for example).
Let me be more concrete. Tor the training/validation process I have provided ~100 different examples of the same-topic conversation between two actors (their character definition is in the system prompt). After the fine-tuning (7 epochs) I have tested the model. If I follow the script (screenplay), the model behaviour is decent. If, in the middle of the conversation, “user” actor make an abstract segue, and asks “What is the meaning of the life?” (similar sentence was not defined in the training dataset), the other actor’s answer will be generated to follow the expected conversation route at that exact place, and will totally omit that abstract question (even though there was a line in the system prompt saying: “not to blindly follow the script, but to talk naturally”).
Does that mean that I have to include as many general and abstract questions in my training/validation data, to be able to generalize of the model’s abilities? How is all of that general knowledge overridden from the basic, pre-trained model? And why the system prompt itself is not helping with this?

I emphasize that with a small number of examples, OpenAI automatically turns up the learning rates and weights. Instead of the paint brush, you get the sledgehammer to use on the canvas.

The number of epochs to make a big difference and blow past validation loss optimum is now far lower than it would be on GPT-3 (where you get to choose all the hyperparameters and they are set to defaults).

You can pull out or make another 10% of questions that are of the same quality to get a validation loss curve also. However, as you report, it is the satisfaction received from the training that is the true measure. A model that can only make JSON is far better than a model that can only make characters that recite their canned answers.

You can also consider the prompting. The system prompt is something that should provide an identity to your chatbot, so it is easy to break away from it acting like ChatGPT with its new identity. However the whole idea of fine-tuning is that the AI doesn’t need to be prompted or be trained by multi-shot examples.

You are kind of doubling-up on the reinforcement if you trained on and use in your app a long list of system prompt instructions that are also exactly what you taught. So to start before spending more, you can roll back on those or even tell it “you are ChatGPT” to see if it will generalize better.