Help with fine-tuning, think I'm over-fitting, but not sure

I’m fine-tuning a gpt-3.5-turbo model and need some help understanding the metrics being spat out.

I’m getting an output curve that looks like this:

I THINK this means I’m… massively… overfitting my fine-tuned models, but am not sure, so that’s my first question… is that what I’m doing?

Second question, assuming #1 is “yes”, is what should I be changing here? I’m running 5 epochs because the outputs I want are very deterministic, just a true/false boolean. Is that too many epochs?

I have a pretty large training set, ~900 examples, with a validation set of ~100 - should I be reducing the number of samples in my training and validation sets?

The data says I’m running “1500 steps”, and from the looks of the chart, I probably want to be stopping at around 400 as that’s where the training and validation loss numbers converge close to zero, but I have no idea what actually controls the number of steps the model runs because it doesn’t seem to be a multiplier of either my example count OR the Epoch count…

1 Like

It’s tough to say what the right course of action is without knowing a little bit about your data.

Part of the issue might be that the training data you are using is too homogeneous, but it’s tough to tell without some understanding of what that data looks like.

Edit: I think I glossed over the part where you described the desired output as simply true/false.

This is almost certainly contributing to your perceived over-fitting issue.

Basically, you have 900 inputs you are performing a binary clarification on. If there is a very clean break in these example inputs, then this might not exactly be over-fitting.

Here’s a toy example:

Say I wanted to make a classifier which could determine if the first letter of an input is a vowel. There’s no “messiness” there. We would expect that with enough training examples the training loss would be 0 and the validation loss would be 0. That’s not over-fitting—it’s just “fitting”—because it is possible to make a perfect classifier.

So, in your case it’s still tough to say without knowing what you’re classifying.

My guess is that it’s not really an issue of too many examples or too many epochs, but rather you don’t have enough really hard examples with nearly identical inputs and different classes.

Another regular forum user found the exact same thing @curt.kennedy for visability.

Yeah @edmund, I saw the same weird TL curve when fine-tuning a binary classifier on this thread over here:

My training file had 4000 examples, and the system decided to choose 3 epochs for this amount of data. So with only 3 epochs, I don’t feel I was overfitting, and all examples were totally different tokens going in (no repeats).

When I get some time, I was going to monitor this model with the old Babbage, and see if there are any discrepancies, or degradation in model performance (since the old model was 4 epochs, and used the same training data).

But initial spot-checks show the new “overfit” model is performing correctly. Just need more data to be confident.

But the TL curve going to 0 is disturbing!

I tried copying all the text and graphs you posted to Chat GPT 3.5-Turbo for help. The answer received from generate first, regenerate second and regenerate third, as I have read it is so very useful(can’t put it here as it is very long). These analyzes may help you decide. Have a nice day.

What does ChatGPT think about training loss being 0.0000? @Gadcuit

I can’t imagine it would think it’s a good thing.

1 Like

… I just thought that training loss=0.000 I didn’t dare to think about it anymore, I didn’t want to think that the validation loss would be equal to 0.000 either.