I am working on a usecase where I need to visually fine-tune gpt-4o. But the size of the jsonl file am constructing for the finetuning exceeds the limit of 8GB that is mentioned on the Uploads API documentation.
A part from doing preprocessings (which can damage the quality of the images maybe) , is there any way to upload the file, with the data as it is right now ? maybe in different parts of 8GB ? but then is it even possible to tell the finetuning job to point to the different uploaded files ?
The fact is, the initial training jsonl file i used was 3.3GB so it was good. But when I did data augmentations i got some thing like 70GB (a lot, i know).
I don’t think so, but maybe you can use something like a decider upfront - maybe based on a wordcloud / keyword density or other search functionality and redirect to differently fintetuned models?
@sps Thanks! nice idea. And so if i got it right, i will need to put my training images on a publicly accessible storage? like for example a publicly accessible s3 bucket maybe or something like that ?
You can continue training using an existing fine tune model.
Just specify the existing “ft:” model that you created previously.
As the learning parameters are adapted to the size of the file, and continuations can have a quality of “overwriting”, you will want to use manual hyperparameters such as a low n_epoch or a learning rate scalar lower than the default 1.0. Then provide equal-size training files to ensure uniformity. This should give you a model that avoids overfitting and doesn’t weight one training file to be stronger.
As the “n_epochs” parameter is literally “passes through the training file”, you may even want to set that to 1, and alternate between retraining files when you deepen the weights.