How to avoid Hallucinations in Whisper transcriptions?

Bai_Lan_Blues · December 11, 2023, 11:50am

I’ve had it set at 0.1 mostly. I’ve been transcribing Japanese audio, and have also used 0.3 temp for this to see if that created some difference.

But the transcription quality is actually very stable and solid. There was an issue with some names being ‘shmushed together’ with other words because they sounded similar to an existing word that also made sense in the context of the sentence.

But adding the missing word to the system prompt immediately fixed this. Also having a few Katakana words in the prompt seems to completely fix issues with ‘borrowed’ English words in the transcription.

I’m not sure how to realiable replicate the issue with repeating characters. I used to sometimes have a few repetitions of “Thanks for watching!” at the end, but now I’ve been seeing a character repetition that basically seems to max out the response size.

If I have a clip that results in such an issue, I consistently get the same result when trying to send that same audio file with different parameters.

The only thing I could ascertain so far, is that these clips will always contain some “uhh” or “ahh” sounds. My prompt contains some examples of how to transcribe those sounds and it generally works perfectly.
But sometimes the transcription spirals out into repetition right around the “ahh” or “hmm” sound.
It could also maybe be a combination of “hmm” or “ahh” with a background music at a certain frequency/volume or something.

Currently I just use regex to detect and remove excessively repeating characters.

I assume that infrequent quirks like this will keep popping up now and then. The behavior of the model probably changes in subtle ways

Oh, and yeah, I actually tried putting an actual instruction in the Whisper prompt, but that expectedly had no effect.

RonaldGRuckus · December 11, 2023, 4:19pm

Use a temperature of 0. This will actually dynamically adjust the temperature for you depending on the confidence.

A low temperature like 0.1 has the highest chance of entering a loop.

The prompt shouldn’t be used for instructions (although I’ve seen people report that it works)

What you can do with the prompt is set the style and perform instructions through showing and not telling.

If you notice there are “uhmmms” and “uhhhs” in the transcript you could begin your prompt as:

“Welcome, this is the beginning of a formal transcript”

Or something along the lines of that

Bai_Lan_Blues · December 13, 2023, 1:20pm

Ah yes, I used 0 temp in the beginning. But I got frustrated when trying to understand exactly how it worked, and generally did not like the fact that I could not see what temp was actually used in each response. (when dynamically adjusted)

Though I only used 0 temp very briefly, so I haven’t yet observed if it affects the frequency of character repetitions

A separate issue was that (after a transcription is sent to gpt4 for translation) the response from gpt4 also included all those character repetitions, despite the fact that the system prompt contains the instruction to omit repetitions. But what I seem to have maybe figured out is that the system message gets ‘pushed out’ of the request if the token limit is reached.

I haven’t confirmed this because it only occurred to me after it happened. The request has to be under the token limit as a whole, but just close enough that it ‘takes up the space’ of the system prompt.

Now I wonder if moving the system prompts to the bottom of the array in the request can change this behavior

edit:
I tried your suggestion of 0 temp and it actually works. I first obtained a short sound clip that consistently resulted in a long string of repeated characters. The same issue occurred with 0.1 on about 10 retries.
With 0 temp with the same clip, the result is correct every time.

RonaldGRuckus · December 13, 2023, 3:16pm

Nice!

It does feel counter-intuitive to use temperature of 0, but glad it worked out.

ogmios · February 24, 2024, 12:13am

For all that is worth and based on others suggestions, adding a prompt with these instructions:
The audio provided may have moments of silence, do not make up words to fill in extended times of silence.
Did work for me it seems. Testing on almost 2 hours video file, segmented with pyDub.

I needed actual VTT files with accurate timestamps and was having repeats when speaker paused for a second or more. Not the most gracious solution, but seems to work.|

UPDATE: It’s a hit and miss. Shortly after posting this and testing another file, I ended up with a file that repeated the prompt above 34 times when there wasn’t even a silence.

kimchibreath · July 15, 2024, 2:14pm

While it is nifty for most use-cases, I have found that this little Whisper heuristic will erroneously omit words from the transcription.

I have performative narrations that do not process the intentional flubs/repeated words a performer will make.

Example:

“Well, maybe… maybe we don’t need to do that”

Transcribes to:

“Well maybe we don’t need to do that”

My temp is 0 and it still does this.

RonaldGRuckus · July 15, 2024, 2:31pm

A temperature of 0 usually results in greedy grabbing but in this case it actually sets a dynamic temperature.

temperature

number

Optional

Defaults to 0

The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

https://platform.openai.com/docs/api-reference/audio/createTranscription

If you want to include utterances you should have a prompt that demonstrates this. For example you could use this as your prompt.

“This is… Uhmm… A live transcript. Maybe… Maybe it will start like this”

The model may also leave out common filler words in the audio. If you want to keep the filler words in your transcript, you can use a prompt that contains them: "Umm, let me think like, hmm… Okay, here’s what I’m, like, thinking."

https://platform.openai.com/docs/guides/speech-to-text/prompting

kimchibreath · July 15, 2024, 5:19pm

Unfortunately, my prompt is already structured to emulate natural speech with filler words and repititions.

RonaldGRuckus · July 15, 2024, 5:21pm

Can you please provide the prompt you are working with? In my experience the prompt has always worked to include these words. Is your prompt by any chance instructional?

kimchibreath · July 15, 2024, 7:02pm

transcriptionPrompt = "Hi folks. I'm going to umm...  read aloud from this book. I apologize if I sound a bit uhh... a bit stuffy or congested. I haven't gotten over this cold yet. I might clear my throat... *ahem* from time to time too."

RonaldGRuckus · July 15, 2024, 7:24pm

Prompt looks great. So I tried an audio recording with and without your prompt. I used whisper/medium locally using the whisper-timestamped library in Python.

def transcribe(language, file_path):
    import whisper_timestamped as wt
    audio = wt.load_audio(file_path)
    model = wt.load_model("medium", device="cuda")
    result = wt.transcribe(model, audio, language=language, initial_prompt="Hi folks. I'm going to umm...  read aloud from this book. I apologize if I sound a bit uhh... a bit stuffy or congested. I haven't gotten over this cold yet. I might clear my throat... *ahem* from time to time too.")
    return result

WITHOUT your prompt:

{‘text’: " Hi, so maybe we could do something like this or maybe not we’ll have to find out. Well maybe we’ll have to find out. We’ll find out. That’s how it’s gonna go. Bop, beep, boop, 20 seconds is coming up."}

WITH your prompt:

{‘text’: " Hi. So umm… maybe… maybe we could do something like this. Or maybe not. We’ll have to find out. Well maybe… maybe we’ll have to find out. We’ll find out. That’s how it’s gonna go. Bop. Beep. Boop. 20 seconds. It’s coming up."}

Could you show your code where you’re calling the whisper endpoint?

Bai_Lan_Blues · September 25, 2024, 7:40pm

I’m revisiting this after a long time.

The recordings that I saved as examples which produced repeated characters or cut-off transcriptions are now being transcribed correctly, even though I’m using the same settings.

This issue does still happen though, especially if a recording starts with a long “ooh” or “ahh” sound. Basically any vocalization that is not a word. I set up my app to switch between parameters quickly so I can always compare results.

I don’t consider this much of a problem anymore, since I can just use regex to remove characters before translation, and can avoid the issue entirely by making sure I don’t start a recording with any non-word vocalizations

I just thought it was interesting that the recordings I had saved which consistently used to produce errors regardless of parameter settings are now transcribing correctly.

Topic		Replies	Views
Whisper hallucination - how to recognize and solve? API whisper	25	16786	July 15, 2024
'Transcription Outsourcing, LLC' repeated throughout whisper transcript API api , whisper , hallucinations , audio	18	286	October 5, 2024
Whisper API skipping on parts of transcriptions API whisper	13	6741	December 27, 2024
Dialog before long pause gets repeated over and over again by Whisper API whisper	3	2069	November 6, 2023
Can whisper be prompted with a previous transcript? Prompting whisper , prompt-engineering	10	2463	July 9, 2023

How to avoid Hallucinations in Whisper transcriptions?

WITHOUT your prompt:

WITH your prompt:

Related topics