Realtime semantic VAD not working

Really excited about the Semantic turn detection feature, but I’ve been experimenting with it in the playground and it doesn’t seem to behave any differently from the normal server VAD. I’ve had the same feedback from my colleagues.

Even with eagerness set to low, and played around different noise reduction options, it just doesn’t ever wait for me to finish speaking (tried the suggested “ummmm”, or just pausing mid sentence). Anyone else experiencing the same issue? Does it need some extra prompting?

I’m speaking english to the model, tried different voices and a combination of settings.

4 Likes

Hey Thomas — I ran into the same quirk while testing non semantic turn detection, I have not tried that model as I use offline whisper.

Whisper’s eager behavior with WebRTC VAD can jump the gun when there’s even a slight pause, especially if you’re thinking mid-sentence or using filler like “umm.”

:wrench: Well I do not have this fully implemented in my system I do have a better working concept:

In my system (Kruel.ai), I built a solution around a dual-buffer, chained-input model:

  1. Voice input goes into a live buffer, but it doesn’t get immediately processed the moment the user stops speaking.
  2. If the user resumes speaking within a chain window (around 1.5 seconds), it re-attaches the new input to the original, treating it as part of the same turn.
  3. Only if that window expires without more input does it finalize the transcription and pass it on.
  4. If a response has already started speaking and new input comes in, we interrupt playback, rebuild the message with the new input, and regenerate the reply.

Basically, it treats you like a human who sometimes pauses mid-thought — instead of a machine firing off a reply the second you breathe.

:test_tube: Results:

With that setup, “ummm” and hesitation gaps are handled way more gracefully. It even lets you correct yourself or extend the input stream, and the system responds like you’re just continuing the same sentence.

I also process each input in separate threads which helps when you have multiple inputs at once allowing you to speed up that reprocess step so it minimizes the time to respond as much as possible.

Hope that helps you and others with a solution to make all input models work better.

I am sure that there could be even better ways perhaps when I swing back to my stt module I will come up with more ideas, but this concept has worked for me for many years.

Also another thing you can do if you have user profiles so that each users inputs are tracked you can create a ML model to learn a person’s speech patterns over time that allows the model to auto adjust the parameters if it knows you are a pause thinker etc. that way the model adjusts on the fly. But that has a hardware cost to processing but is another away to tune things.

You can also create pause detector where hmm, umms, etc are also picked up which could simply create extension to the timer threshold

Cheers.

1 Like

Blockquote

Do you use energy-based detection to tell if user stop speaking ?
I am using silero-vad and appends up to 2 seconds sliding window and keep checking if user stopped talking. It’s quite CPU consuming.

Feature Kruel.ai STT silero-vad approach
VAD Engine WebRTC (frame-based) Silero (window-based)
Buffering On-the-fly (20ms chunks) Sliding window (1–2s)
STT Triggering Post-silence + VAD Silence-detected window flush
CPU Load Very low Moderate to high
Flexibility Full STT decoupled Tight integration

There is a lot of other magic you can do like background noise measurements well no VAD
than use that it set a baseline. Or you can have ML which is more heavy for the first little bit but it can use detected signals to find a sweet spot for you. many many ways to play around

1 Like

Silero vad claimed to have higher precision confused me. I think in a clean environment with high SNR webRTC frame-based solution can give results good enough. I am trunking VOIP calls only 8k sample rate and send 20ms chunk, so perhaps frame-based works better than VAD.
For brower-based solutions, the sample rate of audio like webM is much higher than VOIP and a 200ms VAD is enough (i tested), no need for 1-2sec.

Ben Parry via OpenAI Developer Community <notifications@openai1.discoursemail.com> 於 2025年5月16日 週五 20:13 寫道:

(attachments)

Thanks @shooding — your observation is completely valid, and I agree that Silero VAD offers excellent precision and can outperform WebRTC in lower-SNR or streaming VOIP contexts where extended buffering is acceptable.

In Kruel.ai, we intentionally chose a different design path not because WebRTC is more accurate, but because it fits better with the real-time interaction model we’re optimizing for.


:wrench: Why We Use WebRTC + Whisper in a Hybrid Design

We needed a pipeline that could:

  • Work on live audio with minimal latency
  • Run efficiently on systems with limited CPU/GPU headroom
  • Allow for pause chaining and natural correction mid-thought
  • Decouple voice activity detection from transcription so we could manipulate behavior per user/session context

So we use WebRTC VAD only to trigger Whisper after silence — not to gate every frame. We’ve added logic to:

  • Delay transcription if speech is likely to resume (~1.5s window)
  • Merge back-to-back utterances into a single turn
  • Interrupt playback and restart inference if the user resumes mid-reply

This gives a much more conversational feel, and lets users pause, think, “umm,” or even self-correct without breaking the flow.

You’re right in terms of raw VAD accuracy, especially under acoustic stress, Silero wins. That chart showing Silero’s precision-recall curve absolutely reflects that.

But in our case, we let Whisper handle accuracy and let VAD just handle when to listen. That tradeoff gives us a better UX for on-device, responsive agents — even if it’s not technically the most precise VAD layer.

We are also using whisper offline model for reference with base model.