Order of finetuning data?

As per the post title: does the order of the rows in your finetuning data matter and/or does whatever finetune pipeline OpenAI have set up retain the order of rows during finetuning?

I ask because I have a finetuning dataset with well defined prompt-completion tasks of various difficulty levels and was wondering if there is any benefit to putting the easier tasks first in the training data files before the harder ones. The intuition being that, in the early part of each epoch, before the model is particularly well-tuned, it can learn quickly on the easy tasks thereby pushing the loss down more quickly initially and have it already in a reasonably well tuned state by the time it gets to the more difficult tasks towards the end of each epoch, allowing it to learn better from those. Sort of “pre-finetuning” on easy tasks before “actually finetuning” on the difficult tasks but all rolled into a single finetune job.

I appreciate all the data is used in each epoch one way or another and we don’t have any details of the batch sizes or any of the other training parameters so perhaps so maybe the order doesn’t make any difference but I’m curious to know if anyone has considered/tested messing around with data orderering.

1 Like

Excellent question. Hopefully someone has tested this idea and knows the exact answer.

I just have one worry with your idea. I think most of the time the suggestion for neural networks is to do random order, to make sure each batch has variety of results. Otherwise, it might cause overfitting. However, if that is the case I would be surprised if OpenAI would not automatically randomize the order.

If you decide to try your idea, I suggest you make sure the dataset has good variability no matter the batch size*. Even if it starts with simple questions it is probably good idea to make sure it is not similar questions in a row, like “historic questions” first, then “math questions”… and also not using the same formulaic approach for simple vs. complex questions in row.

*My question: Is the batch size and default for other parameters for the new finetuning API visible somewhere? I did not find it.

No I don’t think any of the parameters are visible. The only parameter we can tweak is number of epochs - at least for the time being.

Also good suggestion on the not grouping different subtopics within the difficulty brackets.

Hi,

I am not aware of any findings that fine tuning ordering makes anything but a minor difference, it will not produce the exact same results, but I think those difference will be in the noise.

1 Like