Fine-tuning process lock on "Validating files..."

Hello,

my lest fine tuning process is lock on state : “Validating files…” since 1 hour

Do you have same problem ?

Thank you

2 Likes

having the same problem. tested on several accounts and getting the same issue.

1 Like

Yes, same problem here, status from uploated files are “None”

1 Like

I’m not sure if this is a recent bug or not but usually this happens when the files are not valid or there is some errors in them. OpenAI has a chunk of code on their git which allows you to check whether the tuning files you used are valid or not

That’s definitely not a validation issue. I tried on the JSONL from the documentation and get the same result.

Same here, i have that error since this morning, now is like 6 hour that i see my model on status “validating_files”

Same here, I had the same error or stuck job for the past 9 hours
Update:
after about 13 hrs my job went to queue and then went to training after a while.

Likewise, been waiting for approx 5 hours.

Same here. It is not a file validation issue because I’ve tried with the same file that I used on friday and it is not working either.

Something happened on 9/22 where this new “validation” step appears and it does take a very long time for whatever reason.

Here is a before and after:

Before, 9/21, it didn’t have this validation message, relatively quick:

After, one day later, 9/22, now we have this “validation” message, it is taking much longer.

Actually something happened today. I have files from friday with this step being used and it took 20s to validate the file and today its been running for 4h already so I don’t think it is related with this change from 9/22

1 Like

Was there anything different about your file?

Me, not much, same basic data with ~4000 JSONL rows, with minor formatting of removing the prepending space to accommodate the new 100k tokenizer.

I must have caught it right as they rolled out the validation step.

So if it was working recently, then I’m not sure what could be slowing it down.

1 Like

Nothing different. It just moved forward after 4h validating the file lol.

btw - my file has only 40 samples

1 Like

now jobs are being stuck on “Waiting” state :grinning: .

my job is stuck on waiting too; had to cancel a few other jobs that were stuck on validating… will wait this one out till tomorrow and see

Just wonder (cause this is actually the first time I am trying fine-tuning)… how long does it on average on a regular day?

A few weeks ago it took anywhere from 10 minutes to 1 hour for me. This is for 3 epochs on 4000 training examples. (~500k trained tokens)

But the 10 minutes to 1 hour variation was for the same basic file and epochs. So not sure what is causing such a large fluctuation in training time. (~50k tokens/min to ~8k tokens/min)

It seems like the bottleneck is somewhere in the “validation server”, not really in the training stage, based on the logs.

2 Likes

9/28/2023, 1:34 AM PDT
image

I guess just run your 10 questions while the world sleeps.

How many trained tokens total @_j ?

Any patterns you have noticed?

Or is this just a one-off training day?

The fluctuation in training time is a real thing I experienced. So not sure if this is a server bottleneck thing, priority thing, or a file size / number of tokens thing.

The more data, the more we can identify the problem (and see if OpenAI can fix it, or is at least aware of it!)

Are you saying run at “midnight” for the world, so 1 am Pacific?

That’s all of 1000 tokens. I’m more demonstrating that there was no training queue wait nor a wait for file processing; to take a forum user’s non-working training file to model was under 10 minutes, including writing scripts…

Ah I see.

I’m guessing the problem is regional then.

Maybe the fine-tuning servers in Europe are having an issue?