How to transcribe long audio to srt file directly?

Hello everyone,
I have a question. I’d like to use the Whisper API to transcribe approximately 2 hours of a conference speech from an mp4 video into an srt subtitle file. I have a few questions:
Do I need to convert the mp4 file into wav or mp3 format first?

It seems that the Whisper API has a file size limit of 25 MB per processing.
If I split the video into chunks, the resulting srt file might have incorrect timecodes.

How should I handle this?

Thank you.

Hi and welcome to the Developer Forum!

You will need to encode your audio into a supported file format, mp4 is supported but you will hugely inefficient if you are also transporting video with your audio, so I would for sure strip out just the audio segment.

OpenAI have a chunking library called pydub that you can install and use to chunk your audio into 25Mb sections with intelligent gap detection to ensure you do not break a word in half at the boundary.

As for time-codes, you will know the length of each audio chunk, with that information you can then keep track of the timecode offset required to add onto the time-stamps with your code as a post processing step.

1 Like

Thank you for your advice. Since I’m a complete beginner in Python, there are many programming commands I’m not clear about, so I have to ask ChatGPT. Previously, I was splitting based on the video’s duration:

chunk_length = 30 * 1000 # in milliseconds
chunks = [audio[i:i + chunk_length] for i in range(0, len(audio), chunk_length)]

  1. So, based on your suggestion, I’ve been able to get relevant code from ChatGPT. I’m not sure if this approach is correct?

** audio = AudioSegment.from_file(“your_audio_file.mp3”)*
** chunk_size = 25 * 1024 * 1024 # 25MB*
** chunks = *
** current_chunk = AudioSegment.empty()*

** for segment in audio:*
** if len(current_chunk) + len(segment) < chunk_size:*
** current_chunk += segment*
** else:*
** chunks.append(current_chunk)*
** current_chunk = segment*

** if len(current_chunk) > 0:*
** chunks.append(current_chunk)*

  1. Additionally, how should I write the program to merge SRT time codes from different parts?

Thank you.