Questions regarding transcribing long audios (>25MB) in Whisper API

mrasifshahzadciit · June 16, 2023, 12:45pm

The documentation mentions the usage of audio splicing and chunking. I’ve done that and let’s say I’ve a 60 minute audio that I want to transcribe so I divide it into 6 chunks of 10 minutes each, and I pass them through Whisper, and get 6x WEBVTT files.

What’s the best approach to string together the WEBVTT files to get a solid .srt file that is also closely accurate? Is that even what I should be doing - since I’ve never done this before, are there ways easier than this?

Basically - how does one go about transcribing a 60 minute audio.

I’m using Python & NodeJS for backend. So kindly let me know how do you go about it?

Foxalabs · June 16, 2023, 1:41pm

typically you would split the audio up into 20Mb or less chunks, usually you would use an intelligent algorithm that looks for silence so you don’t chop a spoken word in half. Once that has been done you send one chunk to Whisper to be transcribed, when that has returned with the text you send another audio chunk off to be processed and when that is returned you append that text to the end of the first text, repeat that procedure until all chunks have been processed and appended.

mrasifshahzadciit · June 16, 2023, 1:47pm

I get that - but let’s suppose first chunk’s timestamps are from 0:00:00 to 0:05:00 and the next chunk is 3 minutes long, but its timestamp also goes from 0:00:00 to 0:03:00. So now I’ve two WEBVTTs with timestamps that starts from 0:00:00, so it’s not exactly convenient to append them. How does one go about this? Or am I missing something here?

Foxalabs · June 16, 2023, 2:28pm

Well, the WEBVTT is a text based format, so you can use standard string and time manipulation functions in your language of choice to manipulate the time stamps so long as you know the starting time stamp for any video audio file, you keep internal track of the time stamps of each split file and then adjust the resulting webttv response to follow that, i.e. you get 0:00:00-0:03:00 back and then you get another 0:00:00-0:03:00 range of values, so you know to add 0:03:00 to any of those values. i.e. you get 00:01:00 returned in the second 3 min long segment, you modify that value to 00:04:00 as you just add 3 to it, there may well be webtvv libraries for python with these functions pre made.

PriNova · June 16, 2023, 4:52pm

Search for the Whisper JAX project, with this you can transcribt and translate audios, videos and YT’s. It is provided through Hugging Face and uses the OpenAI whisper engine and it is very fast.

elin44 · July 2, 2023, 8:58pm

Can Whisper JAX be used as a service (that is, I pay for it), or is it something I have to setup and host on my own server?

elin44 · July 2, 2023, 9:38pm

Also, does JAX handle longer recording (more than an hour) via chunking like described earlier this thread or is that something I still have to do?

PriNova · July 2, 2023, 9:56pm

You can use Whisper Jax local and needed to be installed or with the Inference Endpoint through Hugging Face.

Here is the link to the Git Repository with manuals on how to use.

It was reported, that Whisper JAX can transcript/translate over more than 1 hour of video, audio or YouTube.

elin44 · December 15, 2023, 11:43pm

Ok, I’m now coming back to this… going to setup Whisper JAX on a cloud server, but want to know, has anything newer been introduced lately that I can use instead? I saw that Whisper has a new version (Whisper Big)

Topic		Replies	Views
Whisper Transcription Questions API whisper	10	4344	March 13, 2024
Whisper API server error for long (not big) files API whisper	7	3355	December 18, 2023
How to transcribe long audio to srt file directly? API whisper	3	4011	December 16, 2023
Whisper API: a) Timecodes; b) how good is open-source vs API? API whisper	9	5885	July 28, 2023
Best practice for generating transcriptions from long audio files API	0	445	May 15, 2024

Questions regarding transcribing long audios (>25MB) in Whisper API

Related topics