How to transcribe long audio to srt file directly?

jason123 · October 4, 2023, 10:56am

Hello everyone,
I have a question. I’d like to use the Whisper API to transcribe approximately 2 hours of a conference speech from an mp4 video into an srt subtitle file. I have a few questions:
Do I need to convert the mp4 file into wav or mp3 format first?

It seems that the Whisper API has a file size limit of 25 MB per processing.
If I split the video into chunks, the resulting srt file might have incorrect timecodes.

How should I handle this?

Thank you.

Foxalabs · October 4, 2023, 11:56am

Hi and welcome to the Developer Forum!

You will need to encode your audio into a supported file format, mp4 is supported but you will hugely inefficient if you are also transporting video with your audio, so I would for sure strip out just the audio segment.

OpenAI have a chunking library called pydub that you can install and use to chunk your audio into 25Mb sections with intelligent gap detection to ensure you do not break a word in half at the boundary.

As for time-codes, you will know the length of each audio chunk, with that information you can then keep track of the timecode offset required to add onto the time-stamps with your code as a post processing step.

jason123 · October 5, 2023, 12:21am

Thank you for your advice. Since I’m a complete beginner in Python, there are many programming commands I’m not clear about, so I have to ask ChatGPT. Previously, I was splitting based on the video’s duration:

chunk_length = 30 * 1000 # in milliseconds
chunks = [audio[i:i + chunk_length] for i in range(0, len(audio), chunk_length)]

So, based on your suggestion, I’ve been able to get relevant code from ChatGPT. I’m not sure if this approach is correct?

** audio = AudioSegment.from_file(“your_audio_file.mp3”)*
** chunk_size = 25 * 1024 * 1024 # 25MB*
** chunks = *
** current_chunk = AudioSegment.empty()*

** for segment in audio:*
** if len(current_chunk) + len(segment) < chunk_size:*
** current_chunk += segment*
** else:*
** chunks.append(current_chunk)*
** current_chunk = segment*

** if len(current_chunk) > 0:*
** chunks.append(current_chunk)*

Additionally, how should I write the program to merge SRT time codes from different parts?

Thank you.

Topic		Replies	Views
Questions regarding transcribing long audios (>25MB) in Whisper API API api , whisper	8	10206	December 15, 2023
How to use whisper to handle long video? API api	10	21373	January 30, 2024
Send an hours worth of audio through Whisper using node.js API	7	220	December 11, 2023
Best practice for generating transcriptions from long audio files API	0	511	May 15, 2024
Whisper API server error for long (not big) files API whisper	7	3431	December 18, 2023

How to transcribe long audio to srt file directly?

Related topics