Fine Tune GPT-3.5 Results File: Meaning of "Step", Number of Rows, and Randomization

I fine tuned several GPT-3.5 models varying the size of the training set: one with size 12, another 400, and another with 4000. They were run with 8, 3, and 3 epochs, respectively. I exported the contents of the result_files and have some questions. The first few lines looked like this:


The resulting files have 96, 1200, and 1500 lines (excluding header), respectively.

What do each line represent? Do they each represent running one entry of the training set through stochastic gradient descent, where an epoch is running through all the entries of the training set? If so, then why is the file for the third fine tuning model mentioned only consist of 1500 lines instead of 12000? And when training, does GPT automatically randomize the order for each epoch?

In addition, why is there no value assigned to the column “valid_mean_token_accuracy”?

Steps likely refers to batch size progress.

There are two missing numbers after each comma. The report is not as useful if you didn’t provide validation data with your training data - a set of similar unused inputs that used to evaluate how well the model does.

Reporting could tell you when your training is reaching an optimum point - better if you were then able to continue fine-tune to make use of this information instead of starting again with the new endpoint.

Thanks, this is really helpful! I wasn’t aware that we could attach validation data as well (we’ve been doing this via a separate process). Is it easy to point us to reference on how to do this? Sorry if this is a dense question, but the docs are really terse.

The current docs are just terse, but they tore down the prior completion model documentation that was better, removed github cookbook examples, and block from showing a history of captures that may or may not be correct or useful.

One would have to look to other guides beyond what is current to see how completion models still in operation for a few more months were trained:

One could try to see what function remains with experimentation, but since they wrote that weights and measures no longer works lets you guess that they didn’t include ability for validations performance metrics during tuning (“coming maybe”).