Best practice for generating transcriptions from long audio files

I need to transcribe audio files of up to three hours in length. Should I wait for audio support to be added to the GPT-4o API? Or simply use the Whisper speech to text then clean up with GPT?

For both solutions, the audio file needs to be split into smaller chunks. The question is: How to seamlessly stitch together the resulting text chunks?


