I am fine-tuning the ChatGPT 3.5 Turbo model. When I fine-tune, the default batch size appears to be 1, as inferred from the loss plot. I have only sent hundreds of data points. I am curious as to why the batch size is set to 1. Is there a specific reason for this? I attempted to increase the batch size, but the performance seems to decrease with larger batch sizes. Why does this happen?
The batching of data is not something under your control when fine-tuning now. As hyperparameters, you only get epochs
as a specification, which if not manually specified, is based on the amount of training data. For the actual meaning of “batch”, we go to what you can no longer use from source code:
batch_size: Optional[int]
"""The batch size to use for training.
The batch size is the number of training examples used to train a single forward
and backward pass.
By default, the batch size will be dynamically configured to be ~0.2% of the
number of examples in the training set, capped at 256 - in general, we've found
that larger batch sizes tend to work better for larger datasets.
I think you refer instead to “steps”, the metric now displayed. How these relate seem to only be by the number of times a validation check is run against the validation file and inputs to obtain statistics for you. That can be a high number into the thousands for tens of thousands of examples x epochs, and for lower numbers does appear to be per example per epoch.
I appreciate your view on batch size. However, in my experience, I was able to control it with this API call:
hyperparameters = {"n_epochs":5, 'learning_rate_multiplier': 0.0001, 'batch_size': 10}
In the loss plot, it seems like the number of steps become much smaller due to the larger batch size.
Could you clarify why you think batch size isn’t controllable? Thank you very much.
If you are fine-tuning models that won’t be turned off January 2024 (legacy), with the replacement “fine-tuning” endpoint (not the legacy endpoint with “fine-tune” in the URL - I know, confusing) your learning hyperparameters are limited to just epochs.
All other learning parameters are autotuned based on the training file, and even epochs are automatic if not specified.
We see 1500 steps in some huge jobs, which is quite different than the prior max batch number. I have not trained 10000+ example jobs to discern exactly the now undocumented batch/step behavior at that level. Under 1000 (examples x epochs), you get seemingly one step per example.