Whisper hallucination - how to recognize and solve?

Ok, I am using Whisper API for some time now. It works very good for big languages and almost acceptable for small ones.

However, occasionally it hallucinates and as part of the transcription, it sends back repeated words or phrases. Sometimes, this can be one word repeated many times, other times it is few words one after the other and then repeated again (like a repeated phrase).

I am trying to recognize this pattern and in that case I would send the audio back to transcription again, as it usually does not repeat for the same audio.

I have tried two approaches:

  1. Send the text to OpenAI GPT 3.5 API to recognize suspicious transcription. As I am doing it as part of the other action, this would not bring additional cost, but unfortunately, it is not able to report back consistently. I instructed it to return “[SUSPICIOUS]” if it believes there could be mistakes, but it does so for perfectly good transcriptions as well.

  2. Recognize repeated words with Python function. If one word is repeated, it has worked ok, but the problem is with phrases. I need to cut the text and then recognize 2 repeated words, 3 repeated words, 4 repeated words. I am not even sure this will work. Is there more intelligent way than that?

One way would be using prompting

Hm, I am not really sure what do you mean? Should I tell it not to hallucinate or start repeating one same word? I don’t think it is doing it on purpose.

Write a small part of the corrected transcription in the prompt. For example, if you write English, the model would know that the audio is English.

I understand. It is not a problem with understanding the language. In most cases, it understands automatically, but I include this in the prompt for transcription anyway, I directly tell it which language it is in.

However, what I am saying, is that at some point, it will repeat words out of the sudden. I am pasting examples below - and if I re-do these transcriptions, these mistakes don’t repeat.

For example, this text was transcribed from Croatian:

“Uvijek je bilo da je uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uvijek uv Bolesni je, bolesni sam. Pa reći je, vada da igraš. Ja kaže, igraću. Kaže, vidiš kakav si kao mamut jeven, samo da bi me ne pravio. Ja već vodim, znaš ono. Bio mi Mike Smith neki, nešto kaj, dva brata, koji su nas platili, ali oni su ih oduvali poslije već na pola lige. Mi smo se izvojili. Ja igram i dam, kaka nared, 52 ili 56 pojena.”

Or this second chunk:

“Kako je Vučić zapravo, sad kad se držimo od toga, bavi se meni tako čini, da on koristi jako dobro popularnost i košarke, odnosno Partizana, za svoju nekakvu, ajmo reći, političku kampanju, da pomaže, da koristi to što je na utakmicama Partizana svakog belgijskog direktora, je li tako, Karen, dva ciljača ljudi, ruše se rekordi. Je li on to planski radi ili je stvarno ljubitelj sporta? Znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači, znači U svakom smislu što kažem, mi smo imali pre toga Final 4, sve to ima neku, da kažem, kroniku i prošlost.”

And one more:

“Sada je Zagreb, ali je i Hrvatska. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba.”

One thing that you could try is to use the GPT-4 API to ask for corrections. For example I just tried this prompt:

See this text:


    “Sada je Zagreb, ali je i Hrvatska. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba. Zagreb je zbog Zagreba.”

It was generated by an AI and I don’t know its contents. I experienced that there might be duplicate words or repeated sentences. Reproduce this text but remove all duplicates/repetitions.

=>

The assistant replied with:

1 Like

I am already doing kind of improvements for all texts through GPT3.5 though, due to the API price, and it does manage to improve it sometimes.

However, here, the text I posted, is much longer. Instead of correct transcription, it just repeats these same words and phrases. It is not that there is only one sentence - there are sentences and sentences replaced by repeated babbling.

I am still trying to solve this problem and have managed to do it just partly.

  1. I am recognizing if there are more than X number of repeated words in the text and am then sending audio for re-transcription additional 3 times. Sometimes, this will solve the problem.

  2. Sometimes, the re-transcription is still bad, so I guessed there is something in the file that triggers this Whisper behaviour and makes it do the same mistake every time.

  3. However, as I have also added the possibility of manual re-transcription, if I run it later on, in around half of the occurrences, it will return correct transcription - no repeated words.

Is it possible that repeated words are more common when the server is overloaded? Would that make sense?

Anyone else having the same problem and how are you handling it?

I am running whisper locally and never experienced hallucinations aside from the cases when there were microphone issues.
Am I just lucky or is there something API related causing this behavior?

What happens if you transcribe the same audio twice? Does the hallucination occur persistently? If not, could you feed several transcripts in the prompt for correction or alternatively for a selection of the better solution in the case of GPT 3.5?

1 Like

I use the API and never have ran into issues either. Even with Spanish. I have been running it on my phone using Silero VAD and only experience hallucinations when maybe a single word or two is accidentally caught.

Strange. Whisper actively tries to prevent this exact issue using Beam Search and by using a dynamic temperature setting (if you have set it to 0). Whisper has a ~13% error rate with Croation.

So, three questions:

  1. Are you using a prompt to prime the transcription process?
  2. What is your temperature setting?
  3. How are you starting the audio? You said it’s live. How are you capturing the audio?
1 Like

@vb As said above, tunning the same transcription several times will sometimes solve the problem, other times not.

@RonaldGRuckus What do you mean “running it using Silero VAD”? Are you using Silero to cut silences or for any other purpose?

However, reading your answer, dynamic temperature might be a key here. I was setting it to 0.1 always, thinking that it will make the transcription more accurate, from word to word.

The prompt I am using is basically very simple, not sure I even need it: Precisely transcribe file, keep all the words as they are. Don’t leave anything out. Language was part of the prompt in the past, but now I am sending it to API as a parameter (ISO code).

I have three possibilities of capturing the audio. 1) Uploading the file via web interface, in that case it is a file that is uploaded by user. 2) Capturing the audio via browser extension, user generally starts recording and stops it to send to transcription. 3) Recording in a mobile app, like some kind of modern dictaphone, and sending to Whisper when recording finishes.

On each of these use cases, I am pre-processing the audio, to reduce quality, cut it into chunks when needed etc.

I have had one hilarious hallucination - In a Russian whatsapp speech message that I ran through the model, at the end it added something like
“- transcribed by xxx yyy” (instead of xxx yyy it was a real name of someone who apparently does transcriptions).
I have nothing else to add though

Ok. What happens if you take a problematic transcript and cut it right before the problematic sentence and then transcribe the problematic part separately?

If this helps, there may be an angle with semi-smart chunking.

I use Whisper so I can talk to GPT on my phone in my car or while working. I just use Silero VAD to capture the moments when I’m speaking.

That may be a part of the problem. Try 0 or 1.

If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

I may be wrong here, but the prompt is meant to provide style & context to the transcription, kind of like a “primer”. I don’t believe Whisper follows instructions. You may want to try something like an introduction in Croatian. If it’s casual, it can be like “Hey how’s it going everybody” as a prompt (but in your language)

Since it wasn’t trained using instruction-following techniques, Whisper operates more like a base GPT model.

https://platform.openai.com/docs/guides/speech-to-text/prompting

Personally I would go with number 2 or 3 so that you can control the file format and audio type. Using Silero VAD with a 1 second circular buffer window has worked really well for me. But again, I am continuously recording audio and waiting for the VAD to trigger a speech segement. There may be a better option though. I haven’t tried other VADs (have been meaning to try out WebRTC)

1 Like

These are great ideas. Thanks a lot, Ronald.

I will start with a temperature first. Then I will try if the prompt makes a difference. I have a bunch of these files which were transcribed with repeated words, so there is definitely enough of testing material.

Silero VAD I haven’t tried yet, I don’t think this repeats are related to the silence, but it could probably help with some other stuff.

1 Like

I mean… maybe it could lead to insanity? Not sure. But, LLMs can also enter this situation, usually as a result of “greedy decoding”.

But again, that’s why Whisper uses Beam Search and why a temperature of 0 can help prevent that issue (slightly counter-intuitive).

Keep in mind that Whisper uses a timestamp-based sliding context window as well.

Whisper relies on accurate prediction of the timestamp tokens to determine the
amount to shift the model’s 30-second audio context window by, and inaccurate transcription in one window may
negatively impact transcription in the subsequent windows.
We have developed a set of heuristics that help avoid failure cases of long-form transcription, which is applied in
the results reported in sections 3.8 and 3.9. First, we use
beam search with 5 beams using the log probability as the
score function, to reduce repetition looping which happens
more frequently in greedy decoding. We start with temperature 0, i.e. always selecting the tokens with the highest probability, and increase the temperature by 0.2 up to
1.0 when either the average log probability over the generated tokens is lower than −1 or the generated text has a
gzip compression rate higher than 2.4. Providing the transcribed text from the preceding window as previous-text
conditioning when the applied temperature is below 0.5
further improves the performance

1 Like

I have changed the temperature value from 0,2 to 0 (which now I understand is just a starting point for the model to find the best temperature for that chunk) and have tested it with 20 chunks in Croatian that contained repeated words before.

So, if anyone is having the same problem, I believe this is the right solution. Thanks again, @RonaldGRuckus.

2 Likes

Hey @nikola1jankovic thanks for this post and the contributions. I am currently using whisper and facing similar issues. Specifically, some of my transcriptions are affected by weird word or sentence repetitions, while transcriptions of empty audio content (both for background noise and no audio at all) all show placeholder text like “Thank you” or just “you.”. As I am working in a real-time setup, re-processing those audio clips is not an option. My first idea was to simply process the produced text with a static text analyser like language-tool-python, but I am afraid it’s not trivial to tell hallucinations and honest transcription mistakes apart.
How did you find the hallucinations changed for you by changing the temperature setting to 0? Was there any benefit? Thank you!

Use a Voice Activity Detector to prevent hallucinations from silence. You will need to implement a buffer and have some sort of delay.

A temperature setting of 0 results in a dynamic temperature and not greedy decoding.

Temperature
If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

2 Likes

Voice Activity Detector

Do you mean at record time, and saving everything as a separate clip? Or in pre-processing, and cutting out the silent parts and then stitching them back together? or some other approach?