What is validation_file for?

Hello there! Please someone, open the truth about validation_file in the fine tuning :slight_smile:
Why do I have to use it ? Documentation has a very poor information about validation_file !

A validation file is examples of the same quality as your training that are held out.

At the end of batches during fine-tuning, the generation is run on the trained model vs the validation examples to find the deviation by token scoring.

One can chart the progress to find where the AI model has become optimized for the full breadth of questions, while not becoming over-trained on specifically the inputs and outputs that are just in the training set.


@_j I’m about to test this myself, but I’m curious if there’s expected to be a difference between, say,
a. providing 50 training samples and no validation samples
b. providing 40 training samples and 10 validation samples.

In my case, I’m hoping that including some of samples that haven’t been taken well by the model in the validation set will improve those cases.

The validation file can be held-out questions of the same type, like you describe in the second example, but that can reduce the coverage and the inference ability of the AI if they are also just as valid for training, as they are only used for creating statistics.

You can use the validation file for different applications than strictly another training set. Instead, you might have questions where the pretrained gpt-.35-turbo model is already proficient, and see how the quality degrades as you train on another system prompt and task.

thank you for your reply.

they are only used for creating statistics

this helped me understand better. do you know if, in the case where a validation file isn’t provided, openai uses the whole training file for validation, or a sample of it, or something else?

There is just no validation loss statistics or chart if you don’t include a validation file.

You can test out the fine-tune performance on new inputs instead of just getting only statistics for the investment of writing examples with the quality needed for training, but then not training on them.

@_j hey, sorry if this is the wrong place to ask, but since this topic was already discussing the validation_file:

Is it possible to add a validation_file to an already existing fine-tuning job? I seem to have forgotten to add it to an earlier model I fine-tuned, and I would rather not be charged to fine-tune it again…

I suppose you could get a single score, but not a learning rate, by continuing a fine-tuning model with your validation file, along with a minimum training file at 1 epoch of non-impactful learning multiplier. You pass more tokens, you are going to be charged again. And have another model name hanging around forever.

Or you could go for one more real n_epochs, so that you get a model with just a bit more reinforcement than before to compare.

You wouldn’t be able to obtain the original evaluations that happen at the end of each batch. It is really interpretation of the learning curve that is useful, not just a single point.

Thanks for the swift reply.

In that case I will probably have to train it gain.