Hello everyone,
I have been trying fine tuning GPT-3.5-Turbo for a while, but I cannot find any documentation about how the loss function is computed between training and validation set.
I mean, like RMSE in regression problems I know that the closer to 0 the better and I think here is the same. But what is the formula? How can it be computed between sentences?
Many thanks in advance.
Legacy fine tune offers for the validations:
If you provided a validation file, we periodically calculate metrics on batches of validation data during training time. You will see the following additional metrics in your results file:
- validation_loss: loss on the validation batch
- validation_sequence_accuracy: the percentage of completions in the validation batch for which the model’s predicted tokens matched the true completion tokens exactly. For example, with a
batch_size
of 3, if your data contains the completion [[1, 2], [0, 5], [4, 2]] and the model predicted [[1, 1], [0, 5], [4, 2]], this accuracy will be 2/3 = 0.67- validation_token_accuracy: the percentage of tokens in the validation batch that were correctly predicted by the model. For example, with a
batch_size
of 3, if your data contains the completion [[1, 2], [0, 5], [4, 2]] and the model predicted [[1, 1], [0, 5], [4, 2]], this accuracy will be 5/6 = 0.83
For loss function, the number decreases each batch. If you want to dig deeper, it seems that GPT2 <= pytorch <= BERT, with source: https://github.com/huggingface/transformers/blob/v4.33.3/src/transformers/models/bert/modeling_bert.py#L771
What still remains of any of this in new fine tune is speculative, given how much has been taken away from hyperparameters and from weights and biases compatible results.
We can only take guesses. There’s simply no “here’s how not to waste your money” guide.
Here, I suppose that the best performance is at lowest validation loss, and where there is initial convergence on a best training loss.
After that point, the fine-tune becomes over-specialized on the input and doesn’t infer well the alternate cases of your validation held-out group.
So I would guess the best general performance on the TYPE of questions is at half the epoch you ran (although they alter other hyperparameters on you also if the training file size changes).
After analysis, better would be then to use the investment of preparing your 20% held-out by putting them back into a final unvalidated fine-tune, giving the most varied training. Then you can test performance on unanticipated human inputs.