Realtime semantic VAD issues

I love the concept behind the semantic VAD mode, but in practice, it seems to cause the agent to stop talking mid-sentence a lot, even when there is no background noise. Anyone else seeing this?

Hey there and welcome to the community!

How sensitive is your mic? You may not think you hear noise, but if the mic’s input volume is really high, or if it’s close to something emitting noise, it may still be picked up and interpreted by the model or VAD. There are also filters you could programmatically use to filter out the audio before you send it to the API.

I’m using oss models right now, but I have noticed that my microphone is sometimes too good, or my filter is too aggressive, so things are either missed, or too much is picked up. And so far, it is literally just a game of dialing knobs and experimentation to see what works best, both in the environment and with the input device being used. It’s going to be different for everyone.

Yeah, fair point. It DOES act like it’s hearing a noise. I guess that didn’t seem like the explanation though, because I’ve been using the non-semantic version for a while now, and it’s much less sensitive to interruptions. Plus, in semantic mode, there aren’t any settings for sensitivity, are there?

1 Like

It is a cool concept but I also fell back to server_vad and did more local control to manage interruptions. Feels like an area to watch and switch when behavior aligns with my use case (elder care smart speaker).

1 Like

@mcfinley I might have to switch back to server_vad as well. Do you have any specific tips that worked for you?

@tleyden the short version is local DSP for VAD… check out OpenWakeWord on github … you can see my implementation in open source on chattyfriend dot com.

Oh wow, from your message you said server_vad, but then if I’m understanding correctly you had to abandon that and go to client side VAD? Will check out those links, thanks!