Whisper hallucinations + dropped sentences: Help?

I’m trying to use Whisper to transcribe audio files that contain lots of background noises – mostly forest noises, birds and crickets – and lots of dead air. The audio quality of the speaker varies, sometimes they are close to the mic and sometimes further away.

When attempting to use Whisper (at temperature: 0, 0.01, 0.2 …) I mostly get garbage out. It can successfully transcribe a few sentences and then will just barf out hallucinations, often just repeats of previous phrases. It often gets stuck repeating a single word or phrase and then won’t transcribe any other speech at all.

I thought this might be an audio quality issue so I used ffmpeg to clean up the files. Normalize speech volume and then an RNN to remove most non-speech noise. To a human listener the audio files sound pristine. It barely helped.

Is there something I’m doing wrong or is the technology for transcription only adapted for high-quality, consistent audio? I played with other models which performed far worse. I’m really disappointed that the first helpful use-case I’ve had for these models is a non-starter, and I’m hoping there might be something that I’m missing here. Is there something open source I can use that has better configuration options?

[Edit: Have figured out some workarounds using different models, but wish there was better accuracy for my input type.]

Whisper really lags massively behind in terms of “AI”. It isn’t as smart as their GPT-3.5 or 4 counterparts.

I’m looking forward for a real model that can have a system prompt and then follow that.

One can dream.

1 Like

Try AssemblyAI, I got much better results than OpenAi’s Whisper on our website AI.OpenSubtitles.com. We turned off the Whisper option for now because of the amount of complains we are getting. So we are waiting for OpenAI’s customer support to fix the problems we are having before enabling it again.

Without retraining a new model I find it difficult. There’s not much they can do to fix this with the current version.