Realtime API interruptions are far too sensitive even at a high VAD threshold value

Hey Folks,

Been playing around with the new Realtime API and facing a lot of issues in getting the API to pick up on only relevant interruptions. As you can imagine, a normal conversation typically contains a lot of filler words like ‘hmm’, ‘umm’ or even the occasional ‘okay…’ or ‘right…’ which are not meant to be interruptions but rather an acknowledgement from the user that they understand what the other party is saying.

The Realtime API though treats these as actual interruptions and proceeds to take any such interruptions as an acknowledgement and flag to move to the next part of the conversation. To make matters worse, as another post points out, the API does not properly trim the transcript either meaning the AI thinks the user has context about what was said even after it was interrupted leading to a completely disjoined experience.

I have tried all kinds of values for vad_threshold, prefix_padding and silence_duration but to no avail.

Anyone have any ideas around this?

I think this article might be helpful to you: Improving voice AI's turn detection with transformers