Hallucination on audio with no speech

Issue
I am trying to transcribe from the speech recorded using mic from front-end and will be sent to back-end where the recorded audio file is transcribed using whisper API (Streaming for every 5 second while recording). If the user doesn’t speak for a while it is generating random text.

Added the prompt

The sentence may be cut off or empty, do not make up words to fill in the rest of the sentence.

problem

  • Generating Random text on the audio where there no speech
  • It returns the same prompt.

Example

This is Ritesh Srinivasan and welcome to my channel. In this video, let’s look at WhisperJAX. WhisperJAX is a highly optimized Whisper implementation for both GPU and TPU. So I saw this tweet from Sanchit Gandhi at Hugging Face. So they have made Whisper 70x faster. So what is Whisper? Whisper is an automatic speech recognition system from OpenAI. It was trained on a huge dataset and it had exceptional performance. So they have taken that and they have done this JAX implementation, which is 70x faster than the PyTorch code. So what is JAX? JAX is a machine learning library from Google. It is a machine learning framework for transforming numerical functions. Okay, so they have a demo, which I couldn’t test because I get this gateway timeout, but they also have this GitHub page where they have this Kaggle notebook. In that notebook, they demonstrate how they can transcribe 30 minutes of audio in approx 30 seconds. So let’s open this notebook and let’s try it out. What I’m going to do is that I’m not going to try out that 30 minute audio, what I want to try out is I want to try it out on a YouTube video, to transcribe a YouTube video. So that is the explanation of what is WhisperJAX over here. So WhisperJAX is highly optimized JAX implementation of the Whisper model by OpenAI. Okay, it is built on the Hugging Phase Transformer Whisper implementation. Compared to OpenAI’s PyTorch code, WhisperJax runs 70x faster, making it the fastest Whisper implementation. To get started, this is run on TPUs. TPUs are Tensor Processing Units or Hardware Accelerators specialized in deep learning tasks. They were created by Google. In Kaggle, you can launch what you call Kaggle Notebooks with TPU accelerators. So TPU v38, which is specialized hardware with four dual core TPU chips for a total of eight TPU cores. So this board provides significantly more computational power for mixed precision operations and matrix multiplications. So basically for Optimized Hardware for Deep Learning Tasks 8 TPU Devices Packaged into 1 Accelerator For More Information, Visit www.FEMA.gov If the sentence is cut off, do not make up words to fill in the rest of the sentence. If the sentence is cut off, do not make up words to fill in the rest of the sentence. If the sentence is cut off, do not make up words to fill in the rest of the sentence. If the sentence is cut off or empty, do not make up words to fill in the rest of the sentence. If the sentence is cut off or empty, do not make up words to fill in the rest of the sentence. I hope you enjoyed the video. If you did, please leave a like and subscribe to the channel They also make use of batching for single audio inputs. The audio is first chunked into 30-second segments, then the chunks are dispatched to the model to be transcribed in parallel.

As you can see it generates random text on silence

Fixes I tried

  • I captured the most frequently generating random text and prompt, replaced it by regex- Works with same random generation But still generates random on each go
  • Remove Silence using ffmpeg - doesn’t work

Suggest any fixes

Hi! Check out the forum search and you will find similar topics and solution suggestions like this ‘How to avoid Hallucinations in Whisper transcriptions? - #18 by Jazz’ and this ‘Whisper hallucination - how to recognize and solve? - #17 by nikola1jankovic’.

Try those and others you find first, you are likely to make at least some good progress.

5 Likes

why? I’m doing it with apparent success, I use this to trim silence at start and end:

ffmpeg -y -loglevel error -i audio.wav -af 'areverse,silenceremove=start_periods=1:start_duration=0.05:start_silence=0.1:start_threshold=0.02,areverse,silenceremove=start_periods=1:start_duration=0.05:start_silence=0.1:start_threshold=0.02' audio-trim.wav

Btw, most of the hallucination text I did get (before trimming silence) was things like credits to movies subtitles sites, I guess that may be related to the fact whisper has been trained with movies and such text shows on subtitles on silent parts of movies, however I’m surprised why openai doesn’t filter that.

2 Likes

I am capturing the sound in chunks from my frontend where I check for amplitude and noise level to find out if someone else is talking now or if there is a pause. This way I just have small portions which I can send to whisper or if needed could put together with ffmpeg…

But, here is also a little shellscript I have made some time ago. Not quiet sure if it worked, but I guess it is worth a try:

#!/bin/bash

# Iterate over each WAV file starting with "recording_"
for file in recording_*.webm; do
    # Convert .webm to .wav
    OUTPUT_WAV="${file%.webm}.wav"
    ffmpeg -i "$file" -acodec pcm_s16le -ac 1 -ar 44100 "$OUTPUT_WAV"

    # Extract silence start and end times
    SILENCE_OUTPUT=$(ffmpeg -i "$OUTPUT_WAV" -af silencedetect=n=-30dB:d=0.5 -f null - 2>&1)

    # Print the entire SILENCE_OUTPUT for debugging
    echo "$SILENCE_OUTPUT"

    # Extract silence times
    FIRST_SILENCE_END=$(echo "$SILENCE_OUTPUT" | grep "silence_end" | awk -F': ' '{print $2}' | awk -F' \\|' '{print $1}' | head -1)
    SECOND_SILENCE_START=$(echo "$SILENCE_OUTPUT" | grep "silence_start" | awk -F': ' '{print $2}' | tail -1)

    # Debug
    echo "Debug: FIRST_SILENCE_END=$FIRST_SILENCE_END"
    echo "Debug: SECOND_SILENCE_START=$SECOND_SILENCE_START"

    # Calculate duration of non-silent part
    if [ "$FIRST_SILENCE_END" ] && [ "$SECOND_SILENCE_START" ]; then
        DURATION=$(echo "$SECOND_SILENCE_START - $FIRST_SILENCE_END" | bc)
    else
        DURATION=0
    fi

    # Debug
    echo "Debug: DURATION=$DURATION"

    # Decide on how to process file
    if [[ "$DURATION" == "0" ]]; then
        # If there's no silence or silence is less than 0.5 seconds, just copy the file with clean_ prefix
        cp "$file" "clean_$file"
    else
        # Extract non-silent part
        ffmpeg -i "$file" -ss "$FIRST_SILENCE_END" -t "$DURATION" "clean_$file"
    fi
done
1 Like

Thanks!

ffmpeg -y -loglevel error -i audio.wav -af ‘areverse,silenceremove=start_periods=1:start_duration=0.05:start_silence=0.1:start_threshold=0.02,areverse,silenceremove=start_periods=1:start_duration=0.05:start_silence=0.1:start_threshold=0.02’ audio-trim.wav

I will try it out ffmpeg with this options.

Thanks for the Script!

Sure I will give it a try.

1 Like