Realtime API interruptions are far too sensitive even at a high VAD threshold value

mohit7 · January 21, 2025, 6:53pm

Hey Folks,

Been playing around with the new Realtime API and facing a lot of issues in getting the API to pick up on only relevant interruptions. As you can imagine, a normal conversation typically contains a lot of filler words like ‘hmm’, ‘umm’ or even the occasional ‘okay…’ or ‘right…’ which are not meant to be interruptions but rather an acknowledgement from the user that they understand what the other party is saying.

The Realtime API though treats these as actual interruptions and proceeds to take any such interruptions as an acknowledgement and flag to move to the next part of the conversation. To make matters worse, as another post points out, the API does not properly trim the transcript either meaning the AI thinks the user has context about what was said even after it was interrupted leading to a completely disjoined experience.

I have tried all kinds of values for vad_threshold, prefix_padding and silence_duration but to no avail.

Anyone have any ideas around this?

FrostByte · January 24, 2025, 3:59am

I think this article might be helpful to you: Improving voice AI's turn detection with transformers

Topic		Replies	Views
Bad output when turn detection is not capturing complete thoughts API api-realtime	0	153	February 15, 2025
Realtime API Server turn detection limitations (Suggestion & Help Request) API turn-control , realtime	4	3605	October 14, 2024
Realtime semantic VAD not working API bug , realtime , api-realtime	1	307	March 26, 2025
Silence Detection VAD - pretty neat in Realtime API but very sensitive at times API	1	604	February 5, 2025
Realtime API issues - good practices API realtime , api-realtime , api-realtime-speech	3	870	January 3, 2025

Realtime API interruptions are far too sensitive even at a high VAD threshold value

Related topics