Hello, I want to test the new feature of fine-tuning the gpt-3.5-turbo. The documentation recommends that we should leave the epoch number as it is so they can pick a default for based on dataset size. I just want to know, how is this number calculated based on the dataset so I can simulate how much the fine-tuning will cost me?
Can this help you?
The number of epochs to train is the number of times through the training data.
The goal is to ultimately minimize a loss function, so when you take a known prompt:response pair in the training data, the model will predict a response close to the known response.
This is an iterative process.
The first time through the guesses are pretty poor. After 3–4 times through (depending on the size and quality of your training data) the guesses are generally pretty good. And after, say 100 epochs the guesses might be perfect (or very near)—but this is likely a very bad thing because there is a high probability of over-fitting and when you test the fine-tuning on a hold-out or validation set you will likely get worse results than you did after 3 or 4 epochs.
So, what is the right number of epochs? No one knows. ¯\_(ツ)_/¯
It depends on the quality and quantity of your data, how complicated what your trying to accomplish is, how good the model is at doing the thing without any fine-tuning, and so on.
The general guidance from OpenAI is in the documentation under Iterating on hyper parameters.
hyperparameters - object
The hyperparameters used for the fine-tuning job. See the fine-tuning guide for more details.
n_epochs - string or integer
The number of epochs to train the model for. An epoch refers to one full cycle through the training dataset. “Auto” decides the optimal number of epochs based on the size of the dataset. If setting the number manually, we support any number between 1 and 50 epochs.
You’d have to run fine-tune again on a model and endpoint that support continued training to go over 50.
For some deep documentation on reinforcement learning (whether human tagged or created), here’s a doc with some discussion including unexposed parameters, but we don’t get exactly the current “auto” algorithm for learning rate hyperparameters that considers the model size, training size, etc.
My dude… Chill.
I was illustrating a point about overtraining and there existing some unknowable idek number of epochs.
No one is doing 100 epochs (or even 50).