Hyper-Parameter Fine-tuning Guide

Hey, I’m working on fine-tunes and am trying to get a better idea of how the hyperparameters affect the training outcomes, especially in the context of avoiding over-fitting.

I’ve been attempting to avoid over-fitting by lowering epochs, learning rate multiplier, and batch-size. But simply from looking at the validation vs training loss curve, I can see it is still overtraining.

I realize a lot of this is completely dependent on the data and its size, however, on a recent run of an 8-million token dataset, I lowered my hyper-parameters to 3 epochs, 1.5 learning rate multiplier and a batch size of 2. (Previously defaulted at 3,2,3 respectively). However, with this, the number of steps dramatically increased and I can see the point at which the validation loss stopped converging with training loss at 0 and starts shooting up. Approximately 30% of the way through the steps.

I say all of this to ask if there is a guide out there or if anyone would be able to give a summary of how these hyperparameters work together, how they affect things like training loss convergence with 0, number of steps, depth of training, etc.

Thanks

Batch size directly relates to steps.

You should modify others independent of that setting.

You also have the ability to continue on a model training, and you can compare the quality of 2 epoch vs. 2x 1 epoch, as an example of other training fun.

My spending on fine-tune experiments is done until they let you delete (or at least “archive” the models).

Got it, so then lower batch size = more steps? and yeah, definitely going to just unload on the experimenting on smaller training sets.

Also, 2x1 epoch is post-training and lowering epoch? or is that the ratio to batch size? Sorry lost me a bit there.

You can continue model training now, by specifying the previous fine-tuning model name that you want to build on.

Epochs as a hyperparameter is simply the number of passes through all of your training file at the learning rate. So running the same training again on a previous fine-tuning model should be similar to if you had added the epoch count you give in the new job to the original train. This lets you make gradual deepening of weights at no more token cost, ultimately finding the point of overfitting (but with several undeletable intermediary products).

1 Like

So essentially the best strategy would be to have lower hyperparameters (risk under-fit) then inch forward in re-training to the point of convergence between validation and training loss.