Performance of VAD when audio contains background noise or music

I’m considering implementing support for the realtime API in my translation app, but my main concern is about how VAD performs under conditions with ambient noise or background music.

Implementing realtime would be a lot of work, which wouldn’t be worth it if there are any major issues with music/noise and the VAD. On the other hand, if it does work well then that would be a huge upgrade for my app.

Is there anyone who has some experience with the VAD capability of the realtime API? I would imagine there might be some ways to preprocess the audio stream, perhaps with an equalizer, in ways that help the accuracy of voice detection

1 Like

Hi there!

I was actually curious about that too so I spent a few minutes setting up their realtime demo to try out their VAD.

Turns out it’s really good. Not only does it not trigger when I coughed, clapped or burped for instance, but it also worked perfectly with a constant background noise (TV and typing on my mechanical keyboard)

Two years ago I wanted to create a voice assistant before they release any of the actual voice features and I remember using Silero web VAD which was just as good. I am mentioning this because, as you may know, you don’t have to use OpenAI’s VAD if you do not want to, you can use your own and just use their realtime API for audio in - audio out.

1 Like