This episode is actually a co-production with another podcast called Digital Folklore, which is hosted by Mason Amadeus and Perry Carpenter. We’ve been doing a lot of our research together and our brainstorming sessions have been so thought-provoking, I wanted to bring them on so we could discuss the genre of analog horror together. So, why don’t you guys introduce yourselves so we know who’s who? Yeah, this is Perry Carpenter and I’m one of the hosts of Digital Folklore. And I’m Mason Amadeus and I’m the other host of Digital Folklore. And tell me, what is Digital Folklore? Yeah, so Digital Folklore is the evolution of folklore, you know, the way that we typically think about it. And folklore really is the product of basically anything that humans create that doesn’t have a centralized canon. But when we talk about digital folklore, we’re talking about…
I have very little experience with Whisper but I have noticed a lot of people complaining about it hallucinating when there is little to no audio.
The only solution I could think of is to also note the strength of the current audio, and then use it as a filter on the end product. If there’s very little noise, it would be a fair assumption that there’s nothing to record, and that timeframe can be patched out.
There are no doubts regarding the quality of your recording.
It cuts abruptly however and I have noticed that the transcriptions are hallucinated when there is no noise. I imagine you’re trying to automate the transcriptions of these podcasts? Although I have very little experience with Whisper, my experience with GPT is that it will “try to make sense of nonsense”, which in this case is a sentence that is unfinished.
A couple seconds to verify the text certainly isn’t the end of the world…
You could maybe just take in consideration that it may hallucinate some words if the recording cuts abruptly. Again, going back to monitoring the current strength of the audio.
Maybe it is a safe assumption to say “If audio cuts abruptly → The last sentence may be corrupted”
In your example the last sentence is incomplete, why not just add a filter to check if the last sentence is complete or not?
In my experience with Whisper, it has the lowest transcription error rate, but isn’t perfect. If you use alternatives like AWS Transcribe, you get a higher error rate, but it will at least separate out different speakers for you.
AFAIK, the only way to “prevent hallucinations” is to coach Whisper with the prompt parameter. Otherwise, expect it, and just about everything else, to not be 100% perfect.
But in my business, we switched to Whisper API on OpenAI (from Whisper on Huggingface and originally from AWS Transcribe), and aren’t looking back!
If you are planning on commercializing whisper, this seems like a perfect opportunity to put yourself in a better position than your competitors. Rather than place a warning. I truly believe you can prevent this issue from occurring with just a little bit of elbow grease.
Usually, these kind of features can be expanded as well. If you are monitoring the strength of the audio, you can display it like Whatsapp and other messaging apps do when you create a voice note.
It would be very easy to anticipate a hallucinated ending based on the audio sample you have shown
I am using ffmpeg to split files, but I don’t think it can recognize pauses for that. pydub sounds good, I will check it out.
How much are you using prompt to give instructions to it - and how much does it obey? I have just recently started sending a language of the audio file with it, not sure it helps. Also, it could create a problem if there were some sentences in other languages mixed in, not sure how it would work.
I am now getting fantastic results using prompts like the following :
prompt= (None,“you are a british speaker,please transcribe this into English for me.”
“This will never be in Welsh”
"Do not remove punctuation words like ‘dash’ or ‘new paragraph’
My issue was Whisper removing puntuation words which I process seperately using python code and also using th chat-gtp-4 API.
Whisper iteself did a crazily good thing last week. My user recorded a letter and finished with Best Wishes. She then said “oh sorry, add before best wishes thank you for coming to see me” Whisper transcribed this correctly without transcribing the 'oh sorry part". I couldn’t beleive it!