Issue
I am trying to transcribe from the speech recorded using mic from front-end and will be sent to back-end where the recorded audio file is transcribed using whisper API (Streaming for every 5 second while recording). If the user doesn’t speak for a while it is generating random text.
Added the prompt
The sentence may be cut off or empty, do not make up words to fill in the rest of the sentence.
problem
- Generating Random text on the audio where there no speech
- It returns the same prompt.
Example
This is Ritesh Srinivasan and welcome to my channel. In this video, let’s look at WhisperJAX. WhisperJAX is a highly optimized Whisper implementation for both GPU and TPU. So I saw this tweet from Sanchit Gandhi at Hugging Face. So they have made Whisper 70x faster. So what is Whisper? Whisper is an automatic speech recognition system from OpenAI. It was trained on a huge dataset and it had exceptional performance. So they have taken that and they have done this JAX implementation, which is 70x faster than the PyTorch code. So what is JAX? JAX is a machine learning library from Google. It is a machine learning framework for transforming numerical functions. Okay, so they have a demo, which I couldn’t test because I get this gateway timeout, but they also have this GitHub page where they have this Kaggle notebook. In that notebook, they demonstrate how they can transcribe 30 minutes of audio in approx 30 seconds. So let’s open this notebook and let’s try it out. What I’m going to do is that I’m not going to try out that 30 minute audio, what I want to try out is I want to try it out on a YouTube video, to transcribe a YouTube video. So that is the explanation of what is WhisperJAX over here. So WhisperJAX is highly optimized JAX implementation of the Whisper model by OpenAI. Okay, it is built on the Hugging Phase Transformer Whisper implementation. Compared to OpenAI’s PyTorch code, WhisperJax runs 70x faster, making it the fastest Whisper implementation. To get started, this is run on TPUs. TPUs are Tensor Processing Units or Hardware Accelerators specialized in deep learning tasks. They were created by Google. In Kaggle, you can launch what you call Kaggle Notebooks with TPU accelerators. So TPU v38, which is specialized hardware with four dual core TPU chips for a total of eight TPU cores. So this board provides significantly more computational power for mixed precision operations and matrix multiplications. So basically for Optimized Hardware for Deep Learning Tasks 8 TPU Devices Packaged into 1 Accelerator
For More Information, Visit www.FEMA.gov If the sentence is cut off, do not make up words to fill in the rest of the sentence. If the sentence is cut off, do not make up words to fill in the rest of the sentence. If the sentence is cut off, do not make up words to fill in the rest of the sentence. If the sentence is cut off or empty, do not make up words to fill in the rest of the sentence. If the sentence is cut off or empty, do not make up words to fill in the rest of the sentence. I hope you enjoyed the video. If you did, please leave a like and subscribe to the channel
They also make use of batching for single audio inputs. The audio is first chunked into 30-second segments, then the chunks are dispatched to the model to be transcribed in parallel.
As you can see it generates random text on silence
Fixes I tried
- I captured the most frequently generating random text and prompt, replaced it by regex- Works with same random generation But still generates random on each go
- Remove Silence using ffmpeg - doesn’t work
Suggest any fixes