Realtime API recognizing very short words - WebSockets

I am using the Realtime API connected to Twilio phone calls via WebSockets. It does great in most cases, but often the VAD fails to detect very short words such as “yes” or “yep” when people speak quickly and say only one word, so the agent continues waiting for more input, resulting in long pauses.

I have experimented with adjusting the server_vad settings (lowering the threshold seems to be the most relevant setting) and using semantic_vad at high eagerness, but I haven’t had much luck improving recognition of these short words. Has anyone come up with good ways to handle this particular issue?

The fallback behavior is to detect a long enough pause and ask “Hey are you still there?” but it would be nice to not have to resort to this option if possible.

1 Like

This is how I do it:
// This timer runs every 200ms to detect short utterances like “yes” or “no”
const autoCommitTimer = setInterval(() => {
const elapsedSinceAudio = Date.now() - lastAudioTime;
if (
audioFrameBatch.length > 0 &&
(elapsedSinceAudio > 500) // If no new audio has arrived for 500ms
) {
flushAudioBatch();
if (!hasCommitted && openaiSocket.readyState === WebSocket.OPEN) {
openaiSocket.send(JSON.stringify({ type: ‘input_audio_buffer.commit’ }));
hasCommitted = true;
console.log(‘:counterclockwise_arrows_button: Auto-committed short user utterance due to silence’);
}
}
}, 200);

1 Like

I am having the same kind of issue. did you find any solution @melindacr