Hyper-Parameter Fine-tuning Guide

Vinny.Delgado · April 2, 2024, 2:38am

Hey, I’m working on fine-tunes and am trying to get a better idea of how the hyperparameters affect the training outcomes, especially in the context of avoiding over-fitting.

I’ve been attempting to avoid over-fitting by lowering epochs, learning rate multiplier, and batch-size. But simply from looking at the validation vs training loss curve, I can see it is still overtraining.

I realize a lot of this is completely dependent on the data and its size, however, on a recent run of an 8-million token dataset, I lowered my hyper-parameters to 3 epochs, 1.5 learning rate multiplier and a batch size of 2. (Previously defaulted at 3,2,3 respectively). However, with this, the number of steps dramatically increased and I can see the point at which the validation loss stopped converging with training loss at 0 and starts shooting up. Approximately 30% of the way through the steps.

I say all of this to ask if there is a guide out there or if anyone would be able to give a summary of how these hyperparameters work together, how they affect things like training loss convergence with 0, number of steps, depth of training, etc.

Thanks

_j · April 2, 2024, 3:16am

Batch size directly relates to steps.

You should modify others independent of that setting.

You also have the ability to continue on a model training, and you can compare the quality of 2 epoch vs. 2x 1 epoch, as an example of other training fun.

My spending on fine-tune experiments is done until they let you delete (or at least “archive” the models).

Vinny.Delgado · April 2, 2024, 4:23pm

Got it, so then lower batch size = more steps? and yeah, definitely going to just unload on the experimenting on smaller training sets.

Also, 2x1 epoch is post-training and lowering epoch? or is that the ratio to batch size? Sorry lost me a bit there.

_j · April 2, 2024, 4:29pm

You can continue model training now, by specifying the previous fine-tuning model name that you want to build on.

Epochs as a hyperparameter is simply the number of passes through all of your training file at the learning rate. So running the same training again on a previous fine-tuning model should be similar to if you had added the epoch count you give in the new job to the original train. This lets you make gradual deepening of weights at no more token cost, ultimately finding the point of overfitting (but with several undeletable intermediary products).

Vinny.Delgado · April 2, 2024, 4:36pm

So essentially the best strategy would be to have lower hyperparameters (risk under-fit) then inch forward in re-training to the point of convergence between validation and training loss.

Topic		Replies	Views
Finetuning -- hyperparameter conversions, learning_rate_multiplier values Deprecations fine-tuning , davinci , fine-tuning-problems	3	4354	December 3, 2023
Questions about fine-tuning GPT-3.5-turbo API fine-tuning	1	2170	October 29, 2023
How many Epochs for fine-tunes? API	7	17597	December 28, 2023
Do you change hyperparams when re-finetuning an existing fine tune? API fine-tuning	2	574	December 14, 2023
Can you iteratively train a fine-tune model? API	14	3893	September 20, 2024

Hyper-Parameter Fine-tuning Guide

Related topics