I need to transcribe audio files of up to three hours in length. Should I wait for audio support to be added to the GPT-4o API? Or simply use the Whisper speech to text then clean up with GPT?
For both solutions, the audio file needs to be split into smaller chunks. The question is: How to seamlessly stitch together the resulting text chunks?
Thanks
3 Likes
How I do it:
- you can (optionally) get everything into the wav for faster processing
- identify noise levels and gaps in the recording for the possible splitting points
- align split points to the maximum chunk size
- split the source audio file to the chunks according to your mapping
- convert audio into the 48K OGG mono for faster processing
- transcribe each chunk (I do this in parallel to speed the process up)
- cat them back together
This way I can process one hour of audio at around one minute.
I use the gpt-4o-mini-transcribe model, and it works brilliantly.