Training loss=good, Validation loss=good

I recently fine-tuned a model using the default 3 epochs.

The training loss was: 0.29
The validation loss was: 0.4

As far as I’m aware, this is quite good.

However, the results I’m getting from the model are just not very good. At all. It is as if I never finetuned it at all.

Should I try again with perhaps more epochs?

Bear in mind, it’s not an enormous data set I’m using: 243 examples in the training dataset, and 64 in the validation dataset. However, I could increase the size of that if you guys think that is the problem?

Any guidance would be thoroughly appreciated👍

The most important thing is the data you’re fine-tuning it on.

Are you willing to share some more details about your use-case and possibly a few examples from your dataset?

We’d be able to provide much better suggestions with that extra context.

Sure, I’m finetuning an AI tutor which can perform a range of different function calls (i.e. the make lesson content, lesson plans, course structures, mark questions) etc.

I have a bunch of function call examples for each of them, with a system prompt (always the same), user prompt, and the function_call name and arguments, along with the tools property indicating the functions that the model can perform at that time.


“messages”: [
“role”: “user”,
“content”: “Please make the units for a module of my course on Cell Biology. The module topic is Applications Of Cell Biology. The module entry knowledge is: Knowledge from previous modules., and the module end knowledge is: Exploring real-world applications of cell biology in areas such as medicine, biotechnology, and research…”
“function_call”: {
“name”: “makeUnits”,
“arguments”: “{[array of 5 units, each with two arrays containing 2-3 items each]}”
“role”: “assistant”

If you want any more detail I could give you an email?


I just fine tuned using 550,000 tokens and its like I broke the model. Literally- it can’t answer any questions correctly- and doesn’t even know its persona even though all chat completions and data sets start with:
{“messages”: [{“role”: “system”, “content”: "Your persona is ‘Val’. Val serves as the AI representation of HomeRank, a cutting-edge Home Valuation system…

Finetuning can be weird. I’m sure the system should be able to cope with 500k tokens given that on the pricing page it expresses prices as per 1 Million tokens, suggesting the problem lies elsewhere. Exactly where…I don’t know. I’ve had pretty mixed results with finetuning. Sometimes the models just do weird things :rofl: I’ve just got access to finetuning for gpt-4, and gpt 3.5-0125, which may yield better results.


Is it possible that you guys are expecting the wrong thing from fine-tunes?

did you read this?

It’s technically a way to optimize multi-shot prompts. If you can’t achieve the same results with multi-shot prompting, it’s unlikely that you’ll get good results with fine-tuning :confused:

1 Like

The one thing i’ve dealt with since the latest upgrades to 0125 is over-fitting, previously this wasn’t an issue on my data-sets but now i’ve been adjusting the hyper-parameters when training and it helps a lot.

I’ve kind of experienced that as well.

Here is the plotted training and validation loss for a recent job I did using 0125:

I’m no expert by any means, but here’s my interpretation of what happened (please correct me if I’m talking nonsense):

The training loss steadily decreased, and so did the validation loss. This is good. But then towards the end, the validation loss started to increase. This means the model stopped learning the general problem, and started overfitting.

Maybe the 0125 model learns quicker than previous models, meaning it needs less epochs?? I really don’t know.

Well, “good” in train and val loss is not absolute.
Good val and train loss depends on the deviation from the initial loss as well as the factor influencing the decrease.
so just because its 0. something doesn’t mean its good. Your 0.29 could actually be 0.299999; which is a large number compared to if you had a loss like 0.00009 right?.
You need to iteratively train the model until you start to see promising results. so more training please!
There are other factors that could also be considered but this could be a good starting point