You can line up the times and cut the moments when the VAD is off. In post processing.
For live transcripts you’d need to use a buffer that records only when the VAD is active
You can line up the times and cut the moments when the VAD is off. In post processing.
For live transcripts you’d need to use a buffer that records only when the VAD is active
Thanks @RonaldGRuckus for your input. At the moment, I am using VAD to esclude audio clips where there is no voice activity at all. Because I am working on a live environment, those clips can be just skipped, so the hallucination issue doesn’t get that bad. What really improved things for me was to use prompts. I am currently using the transcript results of the previous 30 seconds of audio.
Curiously enough, it seems like hallucinations can be triggered with a sort of prompt injection, whenever the same word is repeated in the audio over and over again. For instance, 4 or 5 NOs in the input audio will result in tons of NOs in the output transcript.
from my experience, this is caused by poor quality microphone input usually.
It is still happening on 08/May/2024. Has anyone solved this issue? Should I remove all the silence from the mp3 file?
My apologies, I feel like I was addressing a different issue. I was experiencing hallucinations in silence. I am going to check the ‘no_speech_prob’ attribute, and I believe it will help me.
Can someone please explain how the ‘no_speech_prob’ attribute is incorporated into code like this:
from openai import OpenAI
client = OpenAI()
audio_file= open("/path/to/file/audio.mp3", "rb")
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
)
print(transcription.text)
Thanks…