Background Noise Interfering with Realtime API Using Phone

I’m building a phone bot using Twilio and OpenAI’s Realtime API that allows users to call and speak with an AI assistant about store information. I’m running into some challenges with audio processing and would appreciate any guidance from the community.

Current Implementation

  • Using Twilio for phone integration
  • OpenAI Realtime API with GPT-4
  • Server-side VAD for turn detection
  • g711_ulaw audio format
  • Currently implementing voice activity detection with configurable threshold

Issues

  1. Background Noise Sensitivity: The system is picking up ambient voices and background noise, even with high VAD thresholds
  2. Self-feedback Loop: The assistant sometimes picks up its own voice output and responds to it, particularly when users are on speakerphone
  3. Speakerphone Compatibility: These issues are especially pronounced during speakerphone usage, which is a crucial use case for our implementation

What We’ve Tried

  • Increased VAD threshold settings (up to 0.9) in the session configuration:
turn_detection: {
    type: 'server_vad',
    threshold: 0.9,
    silence_duration_ms: 500,
}

Questions

  1. Are there recommended approaches for handling background noise in phone bot implementations?
  2. Should we implement audio preprocessing/filtering before sending the audio to the API? If so, what methods would you recommend?
  3. Are there any specific best practices for handling speakerphone scenarios?
  4. Are there additional configuration options we should be exploring beyond VAD threshold adjustment?

Any insights or recommendations would be greatly appreciated. Thank you!

Technical Details

  • Audio Format: g711_ulaw
  • API Version: realtime-preview-2024-12-17
  • Implementation: Node.js with WebSocket
2 Likes

You have to implement your own Active Echo Cancellation (AEC) and Active Noise Suppression (ANS) before streaming the Audio to the OpenAI Realtime API.

Cheers! :hugs:

Hey, thanks for replying. Do you have any resources or any sort of examples on how to do implement AEC and ANS? I tried using python noisereduce library on each of the audio delta but I don’t think I’m doing this right.

In python, there are probably a ton of resources on how to do this. As I am mainly using Java I won’t be of much help, but you can always search on the internet or ask ChatGPT! :hugs:

@shaumikm Did you ever figure this out?

How do you recommend implementing that with Java? we are using Java as well and have a similar problem.

Well, I use a SIP SDK/Library to route the OpenAI Realtime API Audio through a SIP to the user on an actual phone. It’s called JVoIP but I doubt that that’s what you’re looking for.
JVoIP has AEC and ANC integrated. :hugs:
There are a bunch of SDKs out there though.

Hey shaumikm, here’s a quick-and-dirty recipe to quiet down those unwanted background noises and that pesky self-feedback:

  1. Mic Mute While Speaking: When your AI is dishing out responses, have the mic temporarily mute so it doesn’t catch its own voice. It’s like giving your bot a moment of silence—no echo party here!
  2. Wake Word Detection: Instead of always listening, set it to only actively process audio when a specific wake word is detected. This way, ambient noise is less likely to trigger false activations.
  3. Preprocess with AEC/ANS: Before streaming to the API, run the audio through some filters. Use something like WebRTC’s built-in Acoustic Echo Cancellation (AEC) and noise suppression, or try libraries like RNNoise or SpeexDSP. These tools can significantly clean up both echo and background noise.
  4. Fine-Tune Your VAD: Adjust your Voice Activity Detection thresholds to ignore brief bursts of background chatter. A few tweaks here and there can mean the difference between capturing clear speech and mistaking a random cough for input

I love your first option - mic mute while bot is speaking. but how can we implement it on a regular phone call? remember the user has no software installed but he is speaking with the bot via a regular phone call. how can we force the mic to mute while speaking? I know how to do it using our app but not when Realtime API using phone. Tnx

on the ai serving side you can have it stop taking inputs well it chats. assuming you control that said haha. If you are running VAD etc on the back end you an simple create yourself a flag to track is_speaking to stop inputs until finish.