Fine-tuning jobs failing with "internal error"

This problem has returned after temporarily working again last week (see: Internal Error during fine-tuning).

I’m using a fine-tuning process that has worked successfully hundreds of times over the past several years, but now fails. The training files validate and fine tuning begins, but then fails with an internal error, retries twice and then completely fails.

21:41:50

The job failed due to an internal error.

21:23:42

Fine-tuning job started

21:23:35

The job experienced an error while training and failed, it has been re-enqueued for retry.

21:05:19

Fine-tuning job started

21:05:13

The job experienced an error while training and failed, it has been re-enqueued for retry.

20:47:06

Fine-tuning job started

20:47:04

Files validated, moving job to queued state

20:42:23

Validating training file: file-8KrKVpPVysyZbDaVJZPAqT and validation file: file-ECw1XrDu5rKKsuGuLHqtr7

20:42:23

Created fine-tuning job: ftjob-JiuewuY4cBu9lU8Mo663ICTF

1 Like

Experience the same issue
Fine-tuning job(s) fail deterministically at end-of-epoch / end-of-job boundary and get auto “re-enqueued for retry”.

  • Job A: ftjob-es4MiHzZaAN93vYpJPREfzNI (n_epochs=3, fails right after Step 72/216 = end of epoch 1)

  • Job B: ftjob-C3sKWB8yvYD3hqu263FTmkVx (n_epochs=1, fails right after Step 70/70 = end of job)
    Reproduces across datasets (including previously working) and across base models.

1 Like

I’m seeing the same thing. Dies after epoch 1, retires, keeps dying, eventually fails the job. Re-ran a job that ran fine two days ago and that re-run failed after epoch 1. That tells me this is something on the OpenAI side.

getting the same issue for days now: “The job failed due to an internal error” .tried everything.

Things have “progressed” to jobs not getting past file validation (on files that used to work fine).

Same here – training files that validated fine in a few minutes before have been spinning for 2 hours now.

Just now retried a previous job and it validated and got past the first epoch. Looks like someone rebooted the computer.

I have had several jobs succeed now too. It looks like things are working now, but who knows for how long?

Try finetuning via code. That worked for me awhile back