Hello all,
Since past one week we are seeing serious issues with batch API calls. The API gets stuck in state in_progress for hours. This started happening since last 10 days and has happened 4 times in last days.
Manually cancelling the stuck batch jobs also does not help in getting the API to work. At the end we are left with no solution other than just to wait for API to become responsive.
Once API becomes responsive (usually after 16-24 hrs) it process the same request within a reasonable time. This has happened even with minimal load (meaning with the request of really small payload).
Looking at the previous posts , it seems this is a known issue but has anyone found any concrete solution for this issue. Since API getting stuck like this really makes the API unreliable for us.
We are using finetune gpt35-turbo model in this case. Looking forward to the suggestions and solutions here.