Do the fine tuning API automatically shuffle the dataset?

seokhyunan · November 7, 2023, 3:31pm

I couldn’t find any explanation in the fine tuning API documentation about how they sample the data received to create a batch. Do they randomly sample a number of data equal to the batch size at each step?

Before I submit a dataset for fine tuning (which needs to be shuffled for normal fine tuning), do I need to initially shuffle the original dataset?

seokhyunan · November 8, 2023, 11:31am

Can you help me with this? I think @staff is the only one who can clarify this.

_j · November 8, 2023, 11:43am

You get batch number counts that are very different than the number of epochs. Like a 6-epoch job with 131 batches. So they wouldn’t be necessarily randomly sampled as there has to be cross-batch consistency, so each example gets the same reinforcement.

Random: one in a billion jobs that are trained on a single example. Not good.

An interesting question, but not something where the answer can affect your methods. Experiment with shuffling and iterate continuing on 1-epoch models?

seokhyunan · November 8, 2023, 12:30pm

I think your reply and my question are not on the same page.

Let me provide an example scenario. Assume that the size of my dataset is 1000 and I want to fine tune the davinci-002 model for 3 epochs with batch_size 100. Then, there should be 10 batches per epoch and each batch will be used to update the parameters for each step.

My question is how the API sample 100 data from the dataset to construct a single batch for each update ‘step’. (i.e. random vs squential; e.g., In Pytorch, is the option shuffle in Dataloader class True or False?)

_j · November 8, 2023, 1:44pm

We’re on the same side of the impenetrable knowledge wall, where they aren’t going to disclose today’s machine learning techniques.

All the hyperparameters except epochs are also going away.

What you could do is fine-tune on a modest set for a single run of davinci-002, where you can receive logprobs and echo logprobs.

Create a training set that has two distinct patterns, and is something that would be in no corpus, 10 input sequences each.

train model 1 and model 2 the same
train model three with the examples reversed.

Then you can see if there is any difference reflecting a shuffle or a training order difference, (or anything you can affect or need to concern yourself with) by doing statistical analysis to see if you haven’t made three of the exact same model.

aduico · April 10, 2024, 2:07pm

Hi @seokhyunan, were you able to solve this?

Topic		Replies	Views
Does the line order of the jsonl file affect fine tuning result? API api	0	542	November 9, 2023
Order of fine tuning API	4	1065	November 9, 2023
Order of finetuning data? API	4	783	August 16, 2024
Will repeated finetuning runs produce the same model? API fine-tuning , davinci	0	447	November 24, 2023
Why is the default batch size set to 1 for fine-tuning the ChatGPT Turbo model? API chatgpt	3	4124	November 19, 2023

Do the fine tuning API automatically shuffle the dataset?

Related topics