Realtime API Server turn detection limitations (Suggestion & Help Request)

I am excited for the potential of voice to voice AI conversations, and so I jumped on the realtime API pretty quick. After testing it a fair bit I am struggling with the server turn detection and I think there may be a fundamental limitation with the approach that is unworkable.

The issue is that the API responds to the user before they are done speaking or have finished their thought.

There are cases where a user will say something, and then pause to think, even for a few seconds, and I want the app to respect that they are thinking and wait for them to finish before it responds. I don’t want to apply a universal rule to always wait the same amount of time in every situation.

I see I have access to three settings “threshold” “prefix padding” and “silence duration”, with “silence duration” being the most relevant setting.

I think the solution is that the API needs to be able to interpret the user’s last message and decide not to respond to it, or at least wait longer.

I propose splitting the silence duration into two values (minimum & maximum):

  • Model waits for minimum silence duration before it considering responding.
  • First it considers the likelihood that the user will say more, and if it thinks the user is done then it responds.
  • If it thinks the user isn’t done then it waits for a maximum silence duration before it responds.

I would probably set the minimum silence duration to around 300ms and the maximum silence duration to around 5s as a gut shot.

Has anyone been able to solve this with prompting? Because from my tests it seems no matter what I do the API will always respond even if I explicitly ask it not to.

7 Likes

Here’s a potential solution.

  1. Disable server VAD
  2. Use browser client-side VAD — https://www.vad.ricky0123.com/
  3. Implement your own VAD logic with the minimum & maximum silence duration
  4. For the likelihood detection, you would probably need to run a separate STT service (e.g. Whisper) and prompt it to smaller and faster models like gpt4o-mini to detect (you can also go as far as training / fine-tuning a much smaller model for that)
3 Likes

+1

This is the most important thing to solve on the conversational AI.

There is nothing more frustrating than not letting the debater finish his sentence. Gemini is even worse than the Open AI in this.
I hope Open AI will implement it.

The clue is clearly in the speaker’s intonation. We finish questions with a different tone of voice than just pausing to formulate a thought.

A workaround might be to allow the AI get interrupted by speaking even when she’s replying. But instead of ignoring the preivous question (like Gemini does, which is even more frustrating) merge the prevous and new question together and reply them like they’d be just one.

This is the way. However, voice activity detection primarily involves identifying sounds that exhibit characteristics typical of actual human speech - you don’t need an AI, just DSP algo.

For example, the webrtcvad library offers the essential feature of performing binary detection on 10-30 ms slices of audio. This function can continuously monitor a FIFO buffer, sampling from a dynamic list of certainty levels within that timeframe. This capability is useful for various purposes: detecting the start of speech, identifying prolonged silence, and recognizing speech interruptions, each scenario utilizing a uniquely tuned algorithm to optimize detection sensitivity. Additionally, the level of responsiveness or patience can be adjusted through a user interface control.

The API service aims to deliver rapid responses, which can inadvertently create a sense of urgency and anxiety, preventing the user from fully forming their thoughts and compelling them to continuously speak, often resorting to the use of discourse fillers.

For better chat, we have to wait for a next generation AI that is actually context-aware of what is being spoken, live, like we are when we have natural conversations that don’t expect the listener to hear us ramble on for three minutes, even inserting our little “uh-huh” reinforcements as listening confirmation. I’m sure you can think of tricks to discriminate what is interruption desire…

1 Like

The problem is likely that the model does not “decide” on the appropriate moment when to respond. At least not in the way it is currently implemented. While the audio is fed directly into the model, the output generation is still triggered externally. Otherwise you would not even need those VAD parameters.

One idea could be to let the model actually “decide” by offering a tool that it can call (like “wait_for_user_to_finish_speaking”). But I am skeptical that this will actually work, since it seems the model is trained to always output voice in addition to calling a function. But maybe worth a try.

1 Like