Here’s a potential solution.
- Disable server VAD
- Use browser client-side VAD —
https://www.vad.ricky0123.com/
- Implement your own VAD logic with the minimum & maximum silence duration
- For the likelihood detection, you would probably need to run a separate STT service (e.g. Whisper) and prompt it to smaller and faster models like
gpt4o-mini
to detect (you can also go as far as training / fine-tuning a much smaller model for that)