I am using the Realtime API connected to Twilio phone calls via WebSockets. It does great in most cases, but often the VAD fails to detect very short words such as “yes” or “yep” when people speak quickly and say only one word, so the agent continues waiting for more input, resulting in long pauses.
I have experimented with adjusting the server_vad settings (lowering the threshold seems to be the most relevant setting) and using semantic_vad at high eagerness, but I haven’t had much luck improving recognition of these short words. Has anyone come up with good ways to handle this particular issue?
The fallback behavior is to detect a long enough pause and ask “Hey are you still there?” but it would be nice to not have to resort to this option if possible.