So I’ve been experimenting with finetuning, and I’m particularly interested in how to finetune effectively on a relatively small data set. (<100 prompt completion pairs).
Is there a general rule of thumb for GPT-3 finetuning in terms of epochs. Do fewer examples mean I should run more or fewer epochs?
Likewise, I notice when a certain number of epochs is exceeded, the model just memorizes the answers verbatim. So one should rather tend to fewer epochs, right?
The default is 4 epochs, should you deviate from this with less training data?
First of all, the bigger the model, the better it’ll perform with a small amount of examples. The best way to get the better performance is to spend time creating a few more examples, rather than optimizing hyperparameters.
Number of epochs just means how often does the model see each example - a higher the number, the “better” the memorization. For generative use cases 2 epochs is generally better, as it reduces memorization, and increases generalization. However if you have very few examples, you often can’t get away without increasing the amount of epochs, so that you can perform weight updates at least some reasonable amount of times.
So I guess my suggestion is to try davinci fine-tuning with 3 epochs or so. If you reduce the number of epochs the model largely won’t learn very well, and if you increase it too much it’ll very quickly memorize all the examples.