I have a question. I’d like to use the Whisper API to transcribe approximately 2 hours of a conference speech from an mp4 video into an srt subtitle file. I have a few questions:
Do I need to convert the mp4 file into wav or mp3 format first?
It seems that the Whisper API has a file size limit of 25 MB per processing.
If I split the video into chunks, the resulting srt file might have incorrect timecodes.
How should I handle this?
Hi and welcome to the Developer Forum!
You will need to encode your audio into a supported file format, mp4 is supported but you will hugely inefficient if you are also transporting video with your audio, so I would for sure strip out just the audio segment.
OpenAI have a chunking library called pydub that you can install and use to chunk your audio into 25Mb sections with intelligent gap detection to ensure you do not break a word in half at the boundary.
As for time-codes, you will know the length of each audio chunk, with that information you can then keep track of the timecode offset required to add onto the time-stamps with your code as a post processing step.
Thank you for your advice. Since I’m a complete beginner in Python, there are many programming commands I’m not clear about, so I have to ask ChatGPT. Previously, I was splitting based on the video’s duration:
chunk_length = 30 * 1000 # in milliseconds
chunks = [audio[i:i + chunk_length] for i in range(0, len(audio), chunk_length)]
- So, based on your suggestion, I’ve been able to get relevant code from ChatGPT. I’m not sure if this approach is correct?
** audio = AudioSegment.from_file(“your_audio_file.mp3”)*
** chunk_size = 25 * 1024 * 1024 # 25MB*
** chunks = *
** current_chunk = AudioSegment.empty()*
** for segment in audio:*
** if len(current_chunk) + len(segment) < chunk_size:*
** current_chunk += segment*
** current_chunk = segment*
** if len(current_chunk) > 0:*
- Additionally, how should I write the program to merge SRT time codes from different parts?