I’m building a phone bot using Twilio and OpenAI’s Realtime API that allows users to call and speak with an AI assistant about store information. I’m running into some challenges with audio processing and would appreciate any guidance from the community.
Current Implementation
Using Twilio for phone integration
OpenAI Realtime API with GPT-4
Server-side VAD for turn detection
g711_ulaw audio format
Currently implementing voice activity detection with configurable threshold
Issues
Background Noise Sensitivity: The system is picking up ambient voices and background noise, even with high VAD thresholds
Self-feedback Loop: The assistant sometimes picks up its own voice output and responds to it, particularly when users are on speakerphone
Speakerphone Compatibility: These issues are especially pronounced during speakerphone usage, which is a crucial use case for our implementation
What We’ve Tried
Increased VAD threshold settings (up to 0.9) in the session configuration:
You have to implement your own Active Echo Cancellation (AEC) and Active Noise Suppression (ANS) before streaming the Audio to the OpenAI Realtime API.
Hey, thanks for replying. Do you have any resources or any sort of examples on how to do implement AEC and ANS? I tried using python noisereduce library on each of the audio delta but I don’t think I’m doing this right.
In python, there are probably a ton of resources on how to do this. As I am mainly using Java I won’t be of much help, but you can always search on the internet or ask ChatGPT!
Well, I use a SIP SDK/Library to route the OpenAI Realtime API Audio through a SIP to the user on an actual phone. It’s called JVoIP but I doubt that that’s what you’re looking for.
JVoIP has AEC and ANC integrated.
There are a bunch of SDKs out there though.
Hey shaumikm, here’s a quick-and-dirty recipe to quiet down those unwanted background noises and that pesky self-feedback:
Mic Mute While Speaking: When your AI is dishing out responses, have the mic temporarily mute so it doesn’t catch its own voice. It’s like giving your bot a moment of silence—no echo party here!
Wake Word Detection: Instead of always listening, set it to only actively process audio when a specific wake word is detected. This way, ambient noise is less likely to trigger false activations.
Preprocess with AEC/ANS: Before streaming to the API, run the audio through some filters. Use something like WebRTC’s built-in Acoustic Echo Cancellation (AEC) and noise suppression, or try libraries like RNNoise or SpeexDSP. These tools can significantly clean up both echo and background noise.
Fine-Tune Your VAD: Adjust your Voice Activity Detection thresholds to ignore brief bursts of background chatter. A few tweaks here and there can mean the difference between capturing clear speech and mistaking a random cough for input
I love your first option - mic mute while bot is speaking. but how can we implement it on a regular phone call? remember the user has no software installed but he is speaking with the bot via a regular phone call. how can we force the mic to mute while speaking? I know how to do it using our app but not when Realtime API using phone. Tnx
on the ai serving side you can have it stop taking inputs well it chats. assuming you control that said haha. If you are running VAD etc on the back end you an simple create yourself a flag to track is_speaking to stop inputs until finish.