OpenAI batch API gets stuck for hours with status `in_progress`

Hello all,

Since past one week we are seeing serious issues with batch API calls. The API gets stuck in state in_progress for hours. This started happening since last 10 days and has happened 4 times in last days.
Manually cancelling the stuck batch jobs also does not help in getting the API to work. At the end we are left with no solution other than just to wait for API to become responsive.
Once API becomes responsive (usually after 16-24 hrs) it process the same request within a reasonable time. This has happened even with minimal load (meaning with the request of really small payload).
Looking at the previous posts , it seems this is a known issue but has anyone found any concrete solution for this issue. Since API getting stuck like this really makes the API unreliable for us.
We are using finetune gpt35-turbo model in this case. Looking forward to the suggestions and solutions here.

Can you use 4o-mini instead? It’s even cheaper than 3.5

Also, the stated turnaround time is 24 hrs. Are you experiencing delays longer than that?

Thanks @nicholishen .Is it guaranteed/expected that this issue will not occur with 4o-mini?
Max delay I have seen is close to 24 hrs, so far not longer than that.

Also it will be really helpful if you can share the link for stated turnaround time for this issue.

You should budget 24 hrs for any batch job. My mini batches have come back pretty fast, but my batch jobs haven’t been that big either.

Batch mode can be anywhere form instant to 24 hours, the cost saving is afforded by using spare time on the compute clusters, when busy, it can be a longer than at times when there is more spare capacity.

Basically you should only use it for tasks where a 24 hour delay will not cause an issue.

2 Likes

Got it @nicholishen . Are you using gpt 4o-mini for your batch jobs?

mini is my go-to. I’m mostly using structured outputs now and when mini can’t cut I bump up to 4o, but mini gets the job done for most things.

1 Like

Got it, thanks for your suggestion here.

1 Like

I also encountered this issue when using GPT-4o-mini.

We ran into the same issue from this morning. Batch API stuck in in_progress status for an extended period. :frowning:

We’re using gpt-4o. Tried gpt-4o-mini as well but no luck. There are only few records in the batch. I’ll let the batch run and will check the status tomorrow.

Hi @rajat2 @b00802884 @ajithr007 !

Batch API has been my go-to for many use cases, and I haven’t had too many issues with any of the supported models. I have had instances where finalizing stage (when the output file is being written) takes a long time, but it always completes within 24h.

With that being said, and as others have pointed out, it’s not actually guaranteed to complete within 24h - it uses spare compute capacity and you are only charged for the completed jobs (at 50% discount), which is also not counted towards your “standard model” token limits or rate limits - so it’s very cheap.

The best practice is just to ensure that you don’t have any time-dependent jobs in there, and that you add some extra handling to check for expired state and then retry the batch.

Even though max requests per .jsonl are 50 000, I still prefer to actually split it into sub-batches, and run those, but it sounds like you are all running small number of requests anyway.

2 Likes