I’m using a fine-tuning process that has worked successfully hundreds of times over the past several years, but now fails. The training files validate and fine tuning begins, but then fails with an internal error, retries twice and then completely fails.
21:41:50
The job failed due to an internal error.
21:23:42
Fine-tuning job started
21:23:35
The job experienced an error while training and failed, it has been re-enqueued for retry.
21:05:19
Fine-tuning job started
21:05:13
The job experienced an error while training and failed, it has been re-enqueued for retry.
20:47:06
Fine-tuning job started
20:47:04
Files validated, moving job to queued state
20:42:23
Validating training file: file-8KrKVpPVysyZbDaVJZPAqT and validation file: file-ECw1XrDu5rKKsuGuLHqtr7
20:42:23
Created fine-tuning job: ftjob-JiuewuY4cBu9lU8Mo663ICTF
Experience the same issue
Fine-tuning job(s) fail deterministically at end-of-epoch / end-of-job boundary and get auto “re-enqueued for retry”.
Job A: ftjob-es4MiHzZaAN93vYpJPREfzNI (n_epochs=3, fails right after Step 72/216 = end of epoch 1)
Job B: ftjob-C3sKWB8yvYD3hqu263FTmkVx (n_epochs=1, fails right after Step 70/70 = end of job)
Reproduces across datasets (including previously working) and across base models.
I’m seeing the same thing. Dies after epoch 1, retires, keeps dying, eventually fails the job. Re-ran a job that ran fine two days ago and that re-run failed after epoch 1. That tells me this is something on the OpenAI side.