I know there can’t be a one-size-fits-all answer to this, but I am looking for the general principle.
Here’s some information about my project:
I am creating a fine-tuned model for a specific scenario, the “system” message for all the training examples is the same, for every “assistant” message, I have 3 different ways the “user” can ask it, so this 3x’d my fine-tuning dataset.
As it stands, I have ~2500 examples in my fine-tuning dataset. I am going to use the gpt-3.5-turbo-1106 model initially, but I may try gpt-4-1106-preview later for the same dataset.
Typically 5%-20% of your dataset is retained for evaluation, although there is an argument to be had that just a few can be used externally for validation by a human and you will get a better training run with your remaining eval dataset used as training data…
Thanks for the response. Is there any documentation on how the validation set is used with the fine-tuning API?
My experience of it is with simple classification tasks where there is only one correct answer, but in this case, the generated outputs might still be good even if they did not match the data in my validation set exactly.
I would start with the OpenAI guide here and then perhaps look at youtube for guides on fine-tuning and evaluation