I know there can’t be a one-size-fits-all answer to this, but I am looking for the general principle.
Here’s some information about my project:
I am creating a fine-tuned model for a specific scenario, the “system” message for all the training examples is the same, for every “assistant” message, I have 3 different ways the “user” can ask it, so this 3x’d my fine-tuning dataset.
As it stands, I have ~2500 examples in my fine-tuning dataset. I am going to use the gpt-3.5-turbo-1106 model initially, but I may try gpt-4-1106-preview later for the same dataset.
Typically 5%-20% of your dataset is retained for evaluation, although there is an argument to be had that just a few can be used externally for validation by a human and you will get a better training run with your remaining eval dataset used as training data…
Thanks for the response. Is there any documentation on how the validation set is used with the fine-tuning API?
My experience of it is with simple classification tasks where there is only one correct answer, but in this case, the generated outputs might still be good even if they did not match the data in my validation set exactly.
You could always fine tune twice, first using validation data to find ideal number of epochs and what not, then once done, train a second time but this time include the validation data.
If doing so, be aware that OpenAI automatically sets the training weight based on the input file size, even if you are continuing by specifying an existing model.
If your hold-out set is 20%, you may want to manually set the multiplier hyperparameter to 0.2, and use a low epoch count of just 1 or 2, so that the newest training doesn’t dominate.