I’m fine-tuning a model and I don’t know how exactly to interpret the results.
If a point early on the loss curve has much lower val loss than a later point, does that mean it’s actually better?
Or was it just evaluating on an easy subset of the validation set?
(example: compare the highlighted timestep vs. the final timestep)
Maybe the calculation of validation loss is local …
In the past, I have gotten weird loss curve like this, where I am training on 500,000 tokens and 4,500 examples, and the curve goes to zero, then has this weird bump at the end.
Overall the model appears to be fine, even with this odd curve. I base this on trending models with the new fine-tuning system, and the old system with the same fundamental training data.