Do the fine tuning API automatically shuffle the dataset?

I couldn’t find any explanation in the fine tuning API documentation about how they sample the data received to create a batch. Do they randomly sample a number of data equal to the batch size at each step?

Before I submit a dataset for fine tuning (which needs to be shuffled for normal fine tuning), do I need to initially shuffle the original dataset?

Can you help me with this? I think @staff is the only one who can clarify this.

You get batch number counts that are very different than the number of epochs. Like a 6-epoch job with 131 batches. So they wouldn’t be necessarily randomly sampled as there has to be cross-batch consistency, so each example gets the same reinforcement.

Random: one in a billion jobs that are trained on a single example. Not good.

An interesting question, but not something where the answer can affect your methods. Experiment with shuffling and iterate continuing on 1-epoch models?

I think your reply and my question are not on the same page.

Let me provide an example scenario. Assume that the size of my dataset is 1000 and I want to fine tune the davinci-002 model for 3 epochs with batch_size 100. Then, there should be 10 batches per epoch and each batch will be used to update the parameters for each step.

My question is how the API sample 100 data from the dataset to construct a single batch for each update ‘step’. (i.e. random vs squential; e.g., In Pytorch, is the option shuffle in Dataloader class True or False?)

We’re on the same side of the impenetrable knowledge wall, where they aren’t going to disclose today’s machine learning techniques.

All the hyperparameters except epochs are also going away.

What you could do is fine-tune on a modest set for a single run of davinci-002, where you can receive logprobs and echo logprobs.

Create a training set that has two distinct patterns, and is something that would be in no corpus, 10 input sequences each.

  • train model 1 and model 2 the same
  • train model three with the examples reversed.

Then you can see if there is any difference reflecting a shuffle or a training order difference, (or anything you can affect or need to concern yourself with) by doing statistical analysis to see if you haven’t made three of the exact same model.

Hi @seokhyunan, were you able to solve this?