I had previously fine-tuned a few models using the legacy endpoint (davinci completion engine). I’d done a pretty extensive ablation study to find suitable hyperparameters for my use-case of “conditional generation”. Now, I need to re-fine-tune the new endpoint davinci-002, as the legacy stuff will be deprecated in January.
I had hoped to simply re-fine-tune, updating only the engine, and be done with it. I’d been told that all “best practices” from before would still apply with new endpoints, but that does not seem to be the case. Previously, a learning_rate_multiplier value of 0.16 was great, but now that value is way too small, and my training data format is not even close to getting learned – this can also be seen in looking at the training_loss values.
Using the “auto” setting of hyperparameters puts learning_rate_multiplier at 2.0. For my use-case, this is much too large and I can see some clear overfitting when testing. Without doing another full ablation study, how can I think of a good hyperparameter setting for learning_rate_multiplier?
Note: I’m focusing on learning_rate_multiplier here because it is unclear what its scale should really be – both the batch_size and n_epochs parameters have clear interpretations which I don’t think I should need to mess with.
My training set has 1,700 examples (577,00 total tokens), and when using the legacy davinci fine-tuning, I had great success with n_epochs=2, batch_size=8, and learning_rate_multiplier=0.16.
They won’t say it, but the biggest change is that you are likely fine-tuning a base model 1/8th the size, either in parameters, or the quality of those parameters.
(It is noteworthy that the additional hyperparameters have returned to API. OpenAI had set the untouchable learning rate quite high before when only epochs was exposed. Epochs is easy to interpret: another pass of your training data.)
I think part of the success in gpt-3.5-turbo is pretraining on corpus data until well beyond where the perplexity needle is moving. That, along with massive RLHF to make it chat.
That means you are weighting against a different kind of model than 175b davinci.
I quoted you there because continuing fine-tune is not off the table, essentially adding more epochs when you specify the existing fine-tune model instead of its base. If you can see from the first case and imagine in your learning rate curve that your low multiplier is halfway there (or more) you can just slightly adjust the fine-tune learning, and invest your tokens in continuation instead of starting anew.
A good strategy being in the middle but perhaps needing more than a doubling, is to go for one epoch of significantly higher multiplier, and then one finishing pass back at a lower rate near where you started. The last pass model doesn’t need to be the one you use then if overfitted and monotonous.
Throwing some edge-of-domain validation file in will let you chart actual overfitting you might be testing for yourself later, while instead that is typically identical quality to training examples, shuffleable with or extracted from your training set.
Hm, good point about the model size being different, and that having an effect on learning rate.
I’ve done some work with running one epoch at a time, and adjusting with a decaying learning rate. It’s just hard to understand the kind of scale to use as a starting point, besides taking the auto setting for granted and trusting it.
Edge-of-domain is a good idea – I think I’ll actually need to put some more variance into my training set though too. I’m really trying to walk the lines between
train enough to get the format output correct 100% of the time; vs.
don’t overfit to allow variance in the output content
include training examples with consistent formatting; vs
allow slightly different test-time use (e.g. all training examples are “here are 3 ideas about X”, but I want the option to query for “give me 10 ideas about X”
Ideal hyperparmeters will change from model to model, even when your input data might be the same. In general, the default we set are values that our research team has validated work good in many use cases. There are of course outliers but the general approach to find the new ideal parameters is to go and try and train the model iteratively to see how different values effect the loss.