Whisper leaves out chunks of speech in longer transcript

Dear community,

we want to transcribe speech of up to 1 hour. However, if we use the whisper model larger-v2 with the limitation of 25 mb (so we chunk the files), multiple parts of the speech with a length of 10 to 40 seconds each are missing. We noticed, that the length of these missing pieces also depends on the prompt which is necessarily rather long.
Do you have a solution for that problem?

Best,
Christian

Perhaps you misunderstand “prompt” on the transcription endpoint.

It is not instructions for the AI to follow. It is merely prior text that leads up to where the audio starts, to give the AI stronger hints about how to transcribe speech.

By re-encoding the audio to mono and to a lower voice-oriented codec, you can significantly increase the length you can send and reduce the network time.

However, missing audio may persist, in places where there is background music or noise that confuses whole internal chunks (the whisper AI operates on 30 second pieces of audio, and the endpoint itself uses techniques to join audio).

Silence detection and removal, splitting on silence, and reviewing the word time transcripts programmatically could let you discover gaps in transcriptions, and then re-submit smaller chunks that seem insufficient.

Thanks for your help!

You are right about the prompt, but changing the prompt it still changes the frequency of the gaps.

Would you have a recommendation for input parameters in order to gain a more sensitive transcript with less gaps? Unfortunately, chunks that the model omitted had the best quality and were clearly understandable.

By testing earlier version of whisper (such as 20230124), I noticed that these chunks are not left out in the transcript. As we are depending on the word-timestamps of the newer versions of whisper, is there any way to get an equally permissive behavior in newer versions of whisper?