Best practice for generating transcriptions from long audio files

I need to transcribe audio files of up to three hours in length. Should I wait for audio support to be added to the GPT-4o API? Or simply use the Whisper speech to text then clean up with GPT?

For both solutions, the audio file needs to be split into smaller chunks. The question is: How to seamlessly stitch together the resulting text chunks?

Thanks

3 Likes

How I do it:

  • you can (optionally) get everything into the wav for faster processing
  • identify noise levels and gaps in the recording for the possible splitting points
  • align split points to the maximum chunk size
  • split the source audio file to the chunks according to your mapping
  • convert audio into the 48K OGG mono for faster processing
  • transcribe each chunk (I do this in parallel to speed the process up)
  • cat them back together

This way I can process one hour of audio at around one minute.

I use the gpt-4o-mini-transcribe model, and it works brilliantly.