I have an audio recording that contains no human speech, it’s actually the audio from a video where a woman is cleaning her kitchen. Surprisingly, the OpenAI audio transcription API produces a hallucinated transcription in Korean.
I was expecting the Whisper API to produce an empty transcription for such audio files because I’m developing an application that anticipates audio with or without speech.
Looking for any suggestion to overcome the problem.
You might search for some of the work Nvidia did with the RTX series cards on detecting and isolating speech, it’s actually a non trivial problem. AI’s will always try and find the best match given the input, unless that input is silence there is always a probability of a false detection.
When i input a short English audio voice wav to OpenAI Whisper api, it would occasionally return the Korean translation of my English speech content, though the content and meaning seems to be mostly correct. Correct meaning, wrong language.
I am not a Korean speaker though, so I don’t think any of my setting was pointing to Korean. What could be wrong?