Fine-tuning process lock on "Validating files..."

nicolas.p · October 3, 2023, 12:13pm

Hello,

my lest fine tuning process is lock on state : “Validating files…” since 1 hour

Do you have same problem ?

Thank you

dmytro.a · October 3, 2023, 12:24pm

having the same problem. tested on several accounts and getting the same issue.

GRKdev · October 3, 2023, 12:34pm

Yes, same problem here, status from uploated files are “None”

udm17 · October 3, 2023, 2:01pm

I’m not sure if this is a recent bug or not but usually this happens when the files are not valid or there is some errors in them. OpenAI has a chunk of code on their git which allows you to check whether the tuning files you used are valid or not

dmytro.a · October 3, 2023, 3:07pm

That’s definitely not a validation issue. I tried on the JSONL from the documentation and get the same result.

mirko.artoni · October 3, 2023, 3:17pm

Same here, i have that error since this morning, now is like 6 hour that i see my model on status “validating_files”

rehman.hussnain · October 3, 2023, 4:41pm

Same here, I had the same error or stuck job for the past 9 hours
Update:
after about 13 hrs my job went to queue and then went to training after a while.

Hoblywobblesworth · October 3, 2023, 4:52pm

Likewise, been waiting for approx 5 hours.

thiagosalvatorealby · October 3, 2023, 5:19pm

Same here. It is not a file validation issue because I’ve tried with the same file that I used on friday and it is not working either.

curt.kennedy · October 3, 2023, 5:28pm

Something happened on 9/22 where this new “validation” step appears and it does take a very long time for whatever reason.

Here is a before and after:

Before, 9/21, it didn’t have this validation message, relatively quick:

After, one day later, 9/22, now we have this “validation” message, it is taking much longer.

thiagosalvatorealby · October 3, 2023, 5:42pm

Actually something happened today. I have files from friday with this step being used and it took 20s to validate the file and today its been running for 4h already so I don’t think it is related with this change from 9/22

curt.kennedy · October 3, 2023, 5:45pm

Was there anything different about your file?

Me, not much, same basic data with ~4000 JSONL rows, with minor formatting of removing the prepending space to accommodate the new 100k tokenizer.

I must have caught it right as they rolled out the validation step.

So if it was working recently, then I’m not sure what could be slowing it down.

thiagosalvatorealby · October 3, 2023, 5:50pm

Nothing different. It just moved forward after 4h validating the file lol.

btw - my file has only 40 samples

dmytro.a · October 3, 2023, 8:27pm

now jobs are being stuck on “Waiting” state .

ioannist · October 3, 2023, 8:46pm

my job is stuck on waiting too; had to cancel a few other jobs that were stuck on validating… will wait this one out till tomorrow and see

Just wonder (cause this is actually the first time I am trying fine-tuning)… how long does it on average on a regular day?

curt.kennedy · October 3, 2023, 9:02pm

A few weeks ago it took anywhere from 10 minutes to 1 hour for me. This is for 3 epochs on 4000 training examples. (~500k trained tokens)

But the 10 minutes to 1 hour variation was for the same basic file and epochs. So not sure what is causing such a large fluctuation in training time. (~50k tokens/min to ~8k tokens/min)

It seems like the bottleneck is somewhere in the “validation server”, not really in the training stage, based on the logs.

_j · October 3, 2023, 9:16pm

9/28/2023, 1:34 AM PDT

I guess just run your 10 questions while the world sleeps.

curt.kennedy · October 3, 2023, 9:20pm

How many trained tokens total @_j ?

Any patterns you have noticed?

Or is this just a one-off training day?

The fluctuation in training time is a real thing I experienced. So not sure if this is a server bottleneck thing, priority thing, or a file size / number of tokens thing.

The more data, the more we can identify the problem (and see if OpenAI can fix it, or is at least aware of it!)

Are you saying run at “midnight” for the world, so 1 am Pacific?

_j · October 3, 2023, 9:27pm

That’s all of 1000 tokens. I’m more demonstrating that there was no training queue wait nor a wait for file processing; to take a forum user’s non-working training file to model was under 10 minutes, including writing scripts…

curt.kennedy · October 3, 2023, 9:53pm

Ah I see.

I’m guessing the problem is regional then.

Maybe the fine-tuning servers in Europe are having an issue?

Topic		Replies	Views
Fine-tuning GPT3.5 in validating status for hours API fine-tuning	10	913	May 22, 2024
Anyone has ChatGPT fine-tuning job stuck at validating files? API fine-tuning	3	1519	October 31, 2023
FineTuning stuck on validating Files for 12 hours API fine-tuning-problems	0	683	January 16, 2024
Fine-tuning gpt3.5 stuck in validating_files for 6 hours API fine-tuning-problems	1	205	July 5, 2024
Fine tune job not run even after 15 hours API fine-tuning	14	2169	August 10, 2023

Fine-tuning process lock on "Validating files..."

Related topics