Whisper hallucinations + dropped sentences: Help?

I’m trying to use Whisper to transcribe audio files that contain lots of background noises – mostly forest noises, birds and crickets – and lots of dead air. The audio quality of the speaker varies, sometimes they are close to the mic and sometimes further away.

When attempting to use Whisper (at temperature: 0, 0.01, 0.2 …) I mostly get garbage out. It can successfully transcribe a few sentences and then will just barf out hallucinations, often just repeats of previous phrases. It often gets stuck repeating a single word or phrase and then won’t transcribe any other speech at all.

I thought this might be an audio quality issue so I used ffmpeg to clean up the files. Normalize speech volume and then an RNN to remove most non-speech noise. To a human listener the audio files sound pristine. It barely helped.

Is there something I’m doing wrong or is the technology for transcription only adapted for high-quality, consistent audio? I played with other models which performed far worse. I’m really disappointed that the first helpful use-case I’ve had for these models is a non-starter, and I’m hoping there might be something that I’m missing here. Is there something open source I can use that has better configuration options?

[Edit: Have figured out some workarounds using different models, but wish there was better accuracy for my input type.]

Whisper really lags massively behind in terms of “AI”. It isn’t as smart as their GPT-3.5 or 4 counterparts.

I’m looking forward for a real model that can have a system prompt and then follow that.

One can dream.