Realtime semantic VAD not working

Really excited about the Semantic turn detection feature, but I’ve been experimenting with it in the playground and it doesn’t seem to behave any differently from the normal server VAD. I’ve had the same feedback from my colleagues.

Even with eagerness set to low, and played around different noise reduction options, it just doesn’t ever wait for me to finish speaking (tried the suggested “ummmm”, or just pausing mid sentence). Anyone else experiencing the same issue? Does it need some extra prompting?

I’m speaking english to the model, tried different voices and a combination of settings.

3 Likes

Hey Thomas — I ran into the same quirk while testing non semantic turn detection, I have not tried that model as I use offline whisper.

Whisper’s eager behavior with WebRTC VAD can jump the gun when there’s even a slight pause, especially if you’re thinking mid-sentence or using filler like “umm.”

:wrench: Well I do not have this fully implemented in my system I do have a better working concept:

In my system (Kruel.ai), I built a solution around a dual-buffer, chained-input model:

  1. Voice input goes into a live buffer, but it doesn’t get immediately processed the moment the user stops speaking.
  2. If the user resumes speaking within a chain window (around 1.5 seconds), it re-attaches the new input to the original, treating it as part of the same turn.
  3. Only if that window expires without more input does it finalize the transcription and pass it on.
  4. If a response has already started speaking and new input comes in, we interrupt playback, rebuild the message with the new input, and regenerate the reply.

Basically, it treats you like a human who sometimes pauses mid-thought — instead of a machine firing off a reply the second you breathe.

:test_tube: Results:

With that setup, “ummm” and hesitation gaps are handled way more gracefully. It even lets you correct yourself or extend the input stream, and the system responds like you’re just continuing the same sentence.

I also process each input in separate threads which helps when you have multiple inputs at once allowing you to speed up that reprocess step so it minimizes the time to respond as much as possible.

Hope that helps you and others with a solution to make all input models work better.

I am sure that there could be even better ways perhaps when I swing back to my stt module I will come up with more ideas, but this concept has worked for me for many years.

Also another thing you can do if you have user profiles so that each users inputs are tracked you can create a ML model to learn a person’s speech patterns over time that allows the model to auto adjust the parameters if it knows you are a pause thinker etc. that way the model adjusts on the fly. But that has a hardware cost to processing but is another away to tune things.

You can also create pause detector where hmm, umms, etc are also picked up which could simply create extension to the timer threshold

Cheers.