Questions regarding transcribing long audios (>25MB) in Whisper API

The documentation mentions the usage of audio splicing and chunking. I’ve done that and let’s say I’ve a 60 minute audio that I want to transcribe so I divide it into 6 chunks of 10 minutes each, and I pass them through Whisper, and get 6x WEBVTT files.

What’s the best approach to string together the WEBVTT files to get a solid .srt file that is also closely accurate? Is that even what I should be doing - since I’ve never done this before, are there ways easier than this?

Basically - how does one go about transcribing a 60 minute audio.

I’m using Python & NodeJS for backend. So kindly let me know how do you go about it?

1 Like

typically you would split the audio up into 20Mb or less chunks, usually you would use an intelligent algorithm that looks for silence so you don’t chop a spoken word in half. Once that has been done you send one chunk to Whisper to be transcribed, when that has returned with the text you send another audio chunk off to be processed and when that is returned you append that text to the end of the first text, repeat that procedure until all chunks have been processed and appended.

1 Like

I get that - but let’s suppose first chunk’s timestamps are from 0:00:00 to 0:05:00 and the next chunk is 3 minutes long, but its timestamp also goes from 0:00:00 to 0:03:00. So now I’ve two WEBVTTs with timestamps that starts from 0:00:00, so it’s not exactly convenient to append them. How does one go about this? Or am I missing something here?

Well, the WEBVTT is a text based format, so you can use standard string and time manipulation functions in your language of choice to manipulate the time stamps so long as you know the starting time stamp for any video audio file, you keep internal track of the time stamps of each split file and then adjust the resulting webttv response to follow that, i.e. you get 0:00:00-0:03:00 back and then you get another 0:00:00-0:03:00 range of values, so you know to add 0:03:00 to any of those values. i.e. you get 00:01:00 returned in the second 3 min long segment, you modify that value to 00:04:00 as you just add 3 to it, there may well be webtvv libraries for python with these functions pre made.

1 Like

Search for the Whisper JAX project, with this you can transcribt and translate audios, videos and YT’s. It is provided through Hugging Face and uses the OpenAI whisper engine and it is very fast.

1 Like

Can Whisper JAX be used as a service (that is, I pay for it), or is it something I have to setup and host on my own server?

Also, does JAX handle longer recording (more than an hour) via chunking like described earlier this thread or is that something I still have to do?

You can use Whisper Jax local and needed to be installed or with the Inference Endpoint through Hugging Face.

Here is the link to the Git Repository with manuals on how to use.

It was reported, that Whisper JAX can transcript/translate over more than 1 hour of video, audio or YouTube.

Ok, I’m now coming back to this… going to setup Whisper JAX on a cloud server, but want to know, has anything newer been introduced lately that I can use instead? I saw that Whisper has a new version (Whisper Big)