Hello esteemed community,
I am working on a project for a digital health company, and I need to choose between “clever prompting” and finetuning. While I have been able to reach quite good prompting quality, there is still some room for improvement, and I played around with finetuning.
I’ve created a dataset of roughly 1M tokens for training and another 1M for validation, from high quality datasets related to the business.
When I ran with 10 epochs, the validation loss followed a typical “sweet spot” trend, with consistent improvement over the first half of the training, and then degragation thereafter. So I figured that I overshot with the #epochs, and 5~ would be optimal. So I ran another training on the same data, this time I asked for only 5 epochs. Weirdly, the resulting validation loss graph followed a similar “sweet spot” trend, indicating progress until roughly half time (2.5 epochs) and degragation thereafter.
Did anyone experience a similar phenomenon?
This made me suspect that the learning rate is higher the lower the number of requested epochs are. But that doesn’t make much sense. Also, I am very confused by the plotting of the training and validation loss on the finetuning dashboard. What is the x axis? When I ran 10 epochs, it reached roughly 2000, and when I ran 5 epocs, it aslo reached roughly 2000. Does anyone know what units are measured there?
Finally, can I assume that the model that is outputted corresponds to the epoch with the lowest validation loss, or is it simply the last epoch?
Any ideas?
Thanks
Nir