This episode is actually a co-production with another podcast called Digital Folklore, which is hosted by Mason Amadeus and Perry Carpenter. We’ve been doing a lot of our research together and our brainstorming sessions have been so thought-provoking, I wanted to bring them on so we could discuss the genre of analog horror together. So, why don’t you guys introduce yourselves so we know who’s who? Yeah, this is Perry Carpenter and I’m one of the hosts of Digital Folklore. And I’m Mason Amadeus and I’m the other host of Digital Folklore. And tell me, what is Digital Folklore? Yeah, so Digital Folklore is the evolution of folklore, you know, the way that we typically think about it. And folklore really is the product of basically anything that humans create that doesn’t have a centralized canon. But when we talk about digital folklore, we’re talking about…
I am not a 100 percent sure if this is parameter available in the whisper API as well, but if it is you could try to turn the temperature parameter down.
I have very little experience with Whisper but I have noticed a lot of people complaining about it hallucinating when there is little to no audio.
The only solution I could think of is to also note the strength of the current audio, and then use it as a filter on the end product. If there’s very little noise, it would be a fair assumption that there’s nothing to record, and that timeframe can be patched out.
There are no doubts regarding the quality of your recording.
It cuts abruptly however and I have noticed that the transcriptions are hallucinated when there is no noise. I imagine you’re trying to automate the transcriptions of these podcasts? Although I have very little experience with Whisper, my experience with GPT is that it will “try to make sense of nonsense”, which in this case is a sentence that is unfinished.
A couple seconds to verify the text certainly isn’t the end of the world…
You could maybe just take in consideration that it may hallucinate some words if the recording cuts abruptly. Again, going back to monitoring the current strength of the audio.
Maybe it is a safe assumption to say “If audio cuts abruptly → The last sentence may be corrupted”
In your example the last sentence is incomplete, why not just add a filter to check if the last sentence is complete or not?
In my experience with Whisper, it has the lowest transcription error rate, but isn’t perfect. If you use alternatives like AWS Transcribe, you get a higher error rate, but it will at least separate out different speakers for you.
AFAIK, the only way to “prevent hallucinations” is to coach Whisper with the prompt parameter. Otherwise, expect it, and just about everything else, to not be 100% perfect.
But in my business, we switched to Whisper API on OpenAI (from Whisper on Huggingface and originally from AWS Transcribe), and aren’t looking back!
If you are planning on commercializing whisper, this seems like a perfect opportunity to put yourself in a better position than your competitors. Rather than place a warning. I truly believe you can prevent this issue from occurring with just a little bit of elbow grease.
Usually, these kind of features can be expanded as well. If you are monitoring the strength of the audio, you can display it like Whatsapp and other messaging apps do when you create a voice note.
It would be very easy to anticipate a hallucinated ending based on the audio sample you have shown
No, there will always be transcription errors! I think OpenAI says they expect a 95% rate in English, so 5% bad! Still better than the 70% you get everywhere else. AI is 95% perfect, not 100% perfect
The only thing you can do, is detect short files and send an error back to the user if the file is too short.
Completely. I was actually blown away when I saw that it’s more accurate with Spanish. Although after some conversing it made complete sense.
What I’m regarding is the hallucinations that occur from either cut audio, or moments of silence (which I’ve seen cause Whisper to hallucinate random sentences)
If we are talking mid-word cutoff hallucinations, then use pydub to segment it into <25M chunks, without cutting it mid-word (it can do this, not sure of the setting though) before sending to the API.
Otherwise, get the elbow grease out, and create this yourself, sure.
I am using ffmpeg to split files, but I don’t think it can recognize pauses for that. pydub sounds good, I will check it out.
How much are you using prompt to give instructions to it - and how much does it obey? I have just recently started sending a language of the audio file with it, not sure it helps. Also, it could create a problem if there were some sentences in other languages mixed in, not sure how it would work.
I am now getting fantastic results using prompts like the following :
prompt= (None,“you are a british speaker,please transcribe this into English for me.”
“This will never be in Welsh”
"Do not remove punctuation words like ‘dash’ or ‘new paragraph’
My issue was Whisper removing puntuation words which I process seperately using python code and also using th chat-gtp-4 API.
Whisper iteself did a crazily good thing last week. My user recorded a letter and finished with Best Wishes. She then said “oh sorry, add before best wishes thank you for coming to see me” Whisper transcribed this correctly without transcribing the 'oh sorry part". I couldn’t beleive it!
I was having very similar issues with cut off sentences. As @curt.kennedy mentioned, you can prompt it to your use case. This prompt worked perfectly for me:
“The sentence may be cut off, do not make up words to fill in the rest of the sentence.”
Does that actually work? I thought the prompt was only to inform/influence the formatting of the text, or inclusion of stop-words and specific words/names.
My issue is currently that the end of the transcription is a massive repetition of a single character which completely floods my token limit. It’s infuriating.
The resulting transcription is so large that it seems to mess with the system message to gpt4 when translating. It mentions to omit any needless repetition of words, but that instruction is always ignored.