Thanks for reading through!
These are interesting points and questions, to which I have answers.
Levels might be the way? Check for silent clips and ignore them. Feels like whisper needs to find a main voice to latch on to or it starts transcribing digital noise.
I gave it a clip of me walking loudly while Whisper was configured for “pt”, and it transcribed:
Legendas pela comunidade do Amara.org
Which translates to: “Subtitles by the community of Amara.org”, which is another hallucination.
EDIT: Looks like this is an ongoing discussion: Dataset bias ("❤️ Translated by Amara.org Community") · openai/whisper · Discussion #928 · GitHub
So yea, Levels already flew out of the window - too.
You would need some sort of LLM to tell you if there is voice, or not. Then, it might work.
Or simply just a sucessor to the Whisper model which correctly says “(Silence)” or something along the lines when the audio is quiet.
If you fed each frame individually and asked ‘is there an air conditioner in this image?’ it might shed some light on what’s happening.
This will pretty much work, however, this question is heavily dependant on the video. You would need to run a step which asks all sorts of questions, which will get inaccurate pretty fast.