What do these three metrics specifically represent?
I understand that a lower Training loss indicates better fitting, and a Validation loss greater than Training loss suggests overfitting.
But are there any standards or thresholds for these metrics? I couldn’t find related content in the Guide and would like to ask if anyone knows the specific uses of these three metrics?
In machine learning, the goal is to build a parsimonious model that also minimizes overfit. Hence, for any training, there is a portion of data that is not used for the training and used for validation instead by fitting it on the trained model. This is the validation set. The training set is the set that was used to train the model. Then training loss is the in sample minimized loss function values and validation loss is the out of sample minimized loss function values. These are both for each epoch. The full validation loss is the loss metric for the validation loss across all epochs I hope this clarifies your question.
Thank you for the reply. However, my question mainly focuses on:
Are there specific data ranges for training loss and validation loss, such as [0, 10.00000]?
Are there standard ranges for these two metrics to assess model usability? For example, a model is considered usable if the training loss is less than 1.0000 and the validation loss is less than the training loss, similar to this kind of interpretive guidance.
I was looking for similar benchmarks, but I think it could be a metric based on a specific validation set. Hopefully someone could clarify if it has been normalized. It looks like it could be since it often starts around 1.
Anyway, until someone clarifies, I think the best way is to see if it is reducing and convergences during the finetuning. If it stays close to initial value, then finetuning did not “do anything useful” and we are better of using the base model. On the other hand, if it does not converge then we might benefit from finetuning longer.
Is it usable? If it is better than base model and base model already was somehow usable, then it is probably is. How much better should it be to be woth it? I think this depends on the use case and we can only test in real life (or use another measurement tools).
Just my speculation without knowledge of how OpenAI exactly calculates the loss. Maybe there is now someone with more knowledge and will help us understand better soon.