I’m using OpenAI Realtime API (gpt-realtime) + Twilio Media Streams to build a voice bot.
To stop false triggers from coughs or background noise, I enabled server VAD and added delay:
"turn_detection": {
"type": "server_vad",
"prefix_padding_ms": 300,
"silence_threshold_ms": 1000,
"threshold": 0.9
}
This solves the cough/ambiguous sounds
But now the bot completely ignores short speech like “hello”, “yes”, “no”
So:
-
If I increase VAD delay → short real speech is ignored
-
If I decrease delay → coughs/breathing trigger a response
Question:
How can I ignore noise/coughs but still detect short valid speech in real time?
Is there a better way to handle this? (Custom VAD, buffering, phoneme detection, etc.)
Context / Stack
Twilio → FastAPI WebSocket → OpenAI Realtime (Audio)
Using "gpt-realtime" model with streaming input/output