Realtime API Server turn detection limitations (Suggestion & Help Request)

Here’s a potential solution.

  1. Disable server VAD
  2. Use browser client-side VAD — https://www.vad.ricky0123.com/
  3. Implement your own VAD logic with the minimum & maximum silence duration
  4. For the likelihood detection, you would probably need to run a separate STT service (e.g. Whisper) and prompt it to smaller and faster models like gpt4o-mini to detect (you can also go as far as training / fine-tuning a much smaller model for that)
3 Likes