Hi everyone,
I’m currently building a real-time voice AI agent using the OpenAI Realtime API (gpt-realtime), and I’m facing a major issue with interruptions.
In natural conversations, users often say filler words like “hmm”, “uh-huh”, “ok”, or “right” just to acknowledge they are listening. However, the Realtime API seems to treat these as actual interruptions.
As a result:
-
The assistant stops speaking immediately
-
It assumes the user wants to take over the conversation
-
This breaks the flow and creates a very unnatural experience
In many cases, the user is not trying to interrupt at all they are just passively acknowledging.
I’ve already experimented with multiple parameters such as:
-
vad_threshold
-
prefix_padding
-
silence_duration
But none of these seem to reliably solve the issue.
Another related problem is that when interruptions happen, the transcript/context handling is not always consistent. The assistant sometimes behaves as if the user heard the full message, even though it was cut mid-way, which leads to confusing or disjointed responses.
What I’m looking for:
-
A way to distinguish between acknowledgment sounds vs real interruptions
-
Better control over interruption sensitivity
-
Best practices for tuning turn detection in real-world conversations
Has anyone found a reliable workaround for this?
Or is there an official recommendation for handling this kind of scenario?
Thanks in advance ![]()