Finetuning GPT-3.5 complete early with fewer than 10% steps been run

The fine-tune feature of GPT-turbo-3.5 was just announced. As I’m trying it, I cannot specify hyperparameters like batch_size, so it ran with default batch_size = 1. I have ~9500 * 2 epochs, which should run 19000 steps. However, it finished(seen as successfully completed) in 40 mins with only 1606 steps being run. Though I’m still charged for 19000 examples. I have printed all the related events below. I tried with a smaller toy set, there’s no such problem. Does anyone know what happened?
I checked with the script provided, it shows no error and all 9532 examples/3m+ tokens are read properly.

{
“object”: “list”,
“data”: [
{
“object”: “fine_tuning.job.event”,
“id”: “ftevent-PB”,
“created_at”: 1692907486,
“level”: “info”,
“message”: “Fine-tuning job successfully completed”,
“data”: null,
“type”: “message”
},
{
“object”: “fine_tuning.job.event”,
“id”: “ftevent-wq”,
“created_at”: 1692907484,
“level”: “info”,
“message”: “New fine-tuned model created: ft:gpt-3.5-turbo-0613:xxxx”,
“data”: null,
“type”: “message”
},
{
“object”: “fine_tuning.job.event”,
“id”: “ftevent-V3”,
“created_at”: 1692906627,
“level”: “info”,
“message”: “Step 1000/19265: training loss=0.42”,
“data”: {
“step”: 1000,
“train_loss”: 0.4162065088748932,
“train_mean_token_accuracy”: 0.8743082880973816
},
“type”: “metrics”
},
{
“object”: “fine_tuning.job.event”,
“id”: “ftevent-ga”,
“created_at”: 1692904697,
“level”: “info”,
“message”: “Fine tuning job started”,
“data”: null,
“type”: “message”
},
{
“object”: “fine_tuning.job.event”,
“id”: “ftevent-Z0”,
“created_at”: 1692904694,
“level”: “warn”,
“message”: “Fine tuning job failed, re-enqueued for retry”,
“data”: null,
“type”: “message”
},
{
“object”: “fine_tuning.job.event”,
“id”: “ftevent-n8”,
“created_at”: 1692904473,
“level”: “info”,
“message”: “Fine tuning job started”,
“data”: null,
“type”: “message”
},
{
“object”: “fine_tuning.job.event”,
“id”: “ftevent-IP”,
“created_at”: 1692904472,
“level”: “info”,
“message”: “Created fine-tune: ftjob-pf”,
“data”: null,
“type”: “message”
}
],
“has_more”: false
}

1 Like

Hi,

The logs have two “Fine tuning job started” entries and one warning in between that says, “Fine tuning job failed, re-enqueued for retry”. This means that the job initially failed and was retried automatically.

That would explain the partial 1606 steps processed.

Hi, thank you so much for the reply! Does this mean “retry” will cause bugs to the finetuning process? The 1606 steps are from the final results of the successful try instead of the failed one, and the system somehow considers 1606 steps as completed. The first try failed after a few minutes(timestamp is also given in log) and never made it to 1000 steps.
Either way, it didn’t finish the finetuning process and returned a “completed” status. This finetuning should take about 10 hrs. I still think this is a bug that might constantly occur.

I think all you can do is test the model at this stage and reach out to help.openai.com (help bot bottom right corner) if you have been charged for a partially complete run.

1 Like

The reply from openai help:

I followed up with the fine tuning team and they explained that this was a display bug and that the job should have trained the full steps.

Not sure what kind of display bug they mean. Anyway, they insisted that the training is finished indeed.