Fine tuning on multiple data files

I have a few training files I’ve created, I’m currently fine tuning curie based on my first training data file.

Just wondering … how can I further fine tune the existing model I’ve come up with?

Will I call the same command to train, but reuse some kind of unique model_id given by the API after my first successful fine tuning run?

openai api fine_tunes.create -t '*******' \
  -v '**********' \
  -m 'curie'```

thanks!

You’ll want to concatenate your JSONL files and train a new model AFAIK.

1 Like

do you know how to do this in bash?

No idea. Why are you constrained to bash? I usually do this stuff with a Python script.

not necessarily constrained to bash. I guess I’m just asking are you putting together multiple -t flags with files? The API can accept multiple that way?

No, I mean, concatenate your files ahead of time, sorry for the confusion.

it’s ok - my files would be too big and exceed the limitations, which is why I split them. I would run into the 3 million token limit per file and size would easily go over 80 MB / 100 MB. So how does the concatenation work? Like a zip file?

No, concatenation just means joining two lists together, it will still be a JSONL file. If you have that much data, you probably don’t need to use it all. Remember that fine-tuning benefits from having a variety of tasks as well as having a lot of data. So it would be better to mix your training sets even if the size gets capped. I make sure each task gets equal representation in mine. Say I have 5 different tasks, I make randomized training sets where I ensure that each task type represents 20% of the total set, regardless how big it is.

1 Like

I see what you mean, this is an important distinction. I was getting 99% data accuracy even with a single file lol but I think it overfit to a certain kind of data because it had difficulty with slightly different examples. I’ll keep this in mind next training run.