The Job Failed Due to an Internal Error | Fine-tuning gpt4o-mini

I am facing an error while fine-tuning the gpt4o-mini model. I am able to fine-tune the gpt4o model with the same dataset, but the process fails with gpt4o-mini. The training stops consistently at step 98/1173. Please find the attached image for reference.

I am facing constantly this error: The job experienced an error while training and failed, it has been re-enqueued for retry.

Hello, I have been encountering the same problem since the 16th. Have you found a solution?

Multiple fine-tuning jobs are failing for me as well. This wasn’t the case before. Is it still happening for you as well? Have you found a solution?

I am also in the same situation.
I’d appreciate a solution for this.

Each time I fine-tune, I fine-tune based on 4o-mini, rather than increasing the number of layers.

Has anyone tried this method?
Divide the training set into batches, with 100 data in each batch.
The first batch fine-tunes model A based on 4o-mini
The second batch fine-tunes model B based on model A
The third batch fine-tunes model C based on model B

Until all the data are fine-tuned. Get model N

The basis of this method is that fine-tuning with a small amount of data is mostly successful. Has anyone tried it?

1 Like

I have the same problem. On https://status.openai.com/ it says that on December 16 there was an incident related to the fine-tuning API. It states that it was resolved, but it doesn’t seem to be true…

I am having the same issue. I ran a fine-tuning job twice, it validates the training set (dataset), and shows status: running for a couple of minutes and it also gives me an estimated finish but then I receive the Error: The job failed due to an internal error.

This issue is from OpenAI itself. They don’t have many answers other than stating they faced some downtime. You can mail their support team, and they typically fix it within 2-3 days.

I encountered the exact same error as well. Interestingly, it consistently occurs at a specific iteration. I developed a few hypotheses about why it’s failing and attempted to adjust the format and content of my training data, but I haven’t been able to solve the problem. It would be great if OpenAI could step in and help us resolve this issue.

This is issue form OpenAI itself. You need to mail them if it consistently occurring along with the job id.

Same here with gpt4o-min and also gpt3.5turbo.

Which OpenAI email should we use for this type of error?

Thanks for the suggestion. This works. I kept running into internal error failure while fine tuning 4o-mini. Those were run as a single batch. Setting the hyper-parameter on batch-size to multiple batches such that each batch was roughly 100 did the trick!

A bugfix has been deployed.
You can try to run your jobs again and please report in the topic below, if you should run into further issues.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.