Whisper API server error for long (not big) files

I’m trying to transcribe longer videos of interviews/spoken text. So I create a mp3 file with a 24kb/s bitrate of it and pass it to the Whisper API. Works fine for a 15 minute video, but when I pass a 100 minutes video (about 19MB filesize) I get a server error response from the API everytime.

“The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the request ID c68ec549dfdf0644d4c0124df32fb9a0 in your email.)”

Is there an internal limit of the maximal duration whisper is able to handle (and why wouldn’t that be published?).

There must be some issues with their api. I called fine tune api a bunch of times but 99.99% call failed with server error.

It could get into some undocumented max HTTPS response body size or something. Since 100 minutes of data could generate a large body size in the response, or some model timeout limit.

Obvious solution is to cut down and chunk your input data to not hit this unknown limit.

Yes, I figured splitting is an option, but it’s a really bad one and one in my opinion. I think the API should take care of it, not the requester. Whisper knows what is said, it doesn’t make sense that it expects me to figure out where to cut files to avoid losing context.

Also this does seem to come with a host of new problems. Whisper seems to randomly decide based on the text how long each transcript entry is. For some transcripts it puts up to 60 words into one VTT line (which makes it useless), for others its a new entry every 5 words. This is true for different chunks of the same video/interview. There is no parameters to influence this.

The next issue would be merging the transcripts chunks properly, which isn’t entirely trivial neither given that words and entries even of overlapping content can be different in every chunk.

Quite honestly, it all seems to be a bit of a mess right now, at least for long inputs.

I am splitting 90min audio files when sending to the transcription as well - due to 25Mb size limit. I might lose too much of the quality if decreasing them further.

I still need to go analyzing how many words I lose, but I usually split these into 3 parts, so the worse that could happen would be 3 lost words due to that, I don’t think that is such a problem, because there are probably many more than 3 errors in such a transcription? Did you notice errors when splitting files or are you suspecting these might happen?

I was thinking about splitting with some kind of sliding window approach, but this would probably mean I would need to send all three transcription to GPT-3.5 model to stitch them together intelligently. Have you thought about that?

1 Like

I used to run Whisper on Huggingface. With 30 second chunks! Chunks were created by exact slicing at 30 second mark, without even caring if I cut it mid-word. Putting it together was easy, just concatenate the text in the order of the chunks.

The more advanced version would be to use pydub and set it to not break across a word. But stitching is still easy.

With larger files the word error rate drops substantially in the crossover (less crossovers is why!)

I don’t follow the VTT issue, for me it’s file-in/text-out and its better than AWS Transcribe.

1 Like

I ran the open-source Whisper locally initially but have switched to API now. However, in the local version, receiving transcription with timecodes was possible. I believe this is not possible in the API, right?

If not, would it be possible to instruct it to return a response in a nicer format? Not just big chunk of text? Or would I need to do it later, add line breaks, some paragraphs, just randomly in the text block?