Since past one week we are seeing serious issues with batch API calls. The API gets stuck in state in_progress for hours. This started happening since last 10 days and has happened 4 times in last days.
Manually cancelling the stuck batch jobs also does not help in getting the API to work. At the end we are left with no solution other than just to wait for API to become responsive.
Once API becomes responsive (usually after 16-24 hrs) it process the same request within a reasonable time. This has happened even with minimal load (meaning with the request of really small payload).
Looking at the previous posts , it seems this is a known issue but has anyone found any concrete solution for this issue. Since API getting stuck like this really makes the API unreliable for us.
We are using finetune gpt35-turbo model in this case. Looking forward to the suggestions and solutions here.
Thanks @nicholishen .Is it guaranteed/expected that this issue will not occur with 4o-mini?
Max delay I have seen is close to 24 hrs, so far not longer than that.
Also it will be really helpful if you can share the link for stated turnaround time for this issue.
Batch mode can be anywhere form instant to 24 hours, the cost saving is afforded by using spare time on the compute clusters, when busy, it can be a longer than at times when there is more spare capacity.
Basically you should only use it for tasks where a 24 hour delay will not cause an issue.