I’m trying to use Whisper to transcribe audio files that contain lots of background noises – mostly forest noises, birds and crickets – and lots of dead air. The audio quality of the speaker varies, sometimes they are close to the mic and sometimes further away.
When attempting to use Whisper (at temperature: 0, 0.01, 0.2 …) I mostly get garbage out. It can successfully transcribe a few sentences and then will just barf out hallucinations, often just repeats of previous phrases. It often gets stuck repeating a single word or phrase and then won’t transcribe any other speech at all.
I thought this might be an audio quality issue so I used ffmpeg to clean up the files. Normalize speech volume and then an RNN to remove most non-speech noise. To a human listener the audio files sound pristine. It barely helped.
Is there something I’m doing wrong or is the technology for transcription only adapted for high-quality, consistent audio? I played with other models which performed far worse. I’m really disappointed that the first helpful use-case I’ve had for these models is a non-starter, and I’m hoping there might be something that I’m missing here. Is there something open source I can use that has better configuration options?
[Edit: Have figured out some workarounds using different models, but wish there was better accuracy for my input type.]