Anyone experience whisper hallucinating on empty sections? In my case, I’m dealing with audio/video in Indonesian, and usually when there’s an empty section at the beginning or the end, whisper will fill in something like “thanks for watching” or “sub by x”. is there anyway to prevent this? maybe with vad filter?
I had the same issue but with the regular text model. It tends to do that when it get’s the feeling that the provided content is not finished. What I have done is to specify this by appending a string variable in the API-Request where I stated that. Maybe you can do something similar with whisper?
I took a glance at the documentation and there it is stated that you can use an optional prompt to guide the way the model replies. There you can state the behavior you want to see.
Refer to the Link below (Documentation → Audio)
I think any sort of threshold would be a great way to filter out any noise caused by … no noise …
I have some success fighting this issue just processing the file through ffmpeg with a
silenceremove command before sending the file to Whisper. Something like this:
ffmpeg --fflags +discardcorrupt -y -i <file_name> -ar 8000 -af silenceremove=start_periods=1:stop_periods=-1:start_threshold=-30dB:stop_threshold=-30dB:start_silence=2:stop_silence=2. You would probably change the
-ar (the sample rate) and some
silenceremove flags depending on your audio, for that you can refer to this page.