Realtime API starts to answer itself with mic+speaker setup

I’m using Realtime API with Python on Raspberry Pi 4. When I say something as input and as API responds it will pick up it’s own voice as new input and goes into chaotic loop of babbling with itself.

Are there any options to reduce this? I have tried changing the turn_detection settings like:

turn_detection=ServerVAD(type="server_vad", threshold=0.5, prefix_padding_ms=200, silence_duration_ms=200),

and

turn_detection=ServerVAD(type="server_vad", threshold=0.8, prefix_padding_ms=1000, silence_duration_ms=2000),

But options don’t seem to have much effect.

Could it be that Raspberry Pi is somewhat slower and causes enough delay for playback so it doesn’t detect that as it’s own speech?

2 Likes

Your search terms:

“pulseaudio echo cancellation module Raspberry Pi”

:blush:

3 Likes

I got the similar issue on iOS. There are few libraries we could use on the iOS. I found one of them works good.

You can find my code in github account fuwei007/OpenAIIOSRealtimeAPIDemo

Echo cancellation seems to be the way but hard to enable.

I have not been able to get it cancel the speaker sound properly yet. Anyone else have Raspberry Pi happily using realtime API with mic+speakers and share the configs? :slight_smile:

Something I’ve tried for activating the echo cancelling module:
in /etc/pulse/default.pa:

load-module module-echo-cancel rate=11025 aec_method=webrtc source_name=aec_source source_properties=device.description=aec_source sink_name=aec_sink sink_properties=device.description=aec_sink
set-default-source aec_source
set-default-sink aec_sink

Also setting the default sample rate in /etc/pulse/daemon.conf to 11025 as my soundcard seems to work better with this. 24000 gave invalid sample rate errors.

default-sample-rate = 11025

I resample realtime API input and output audio 11,025kHz->24kHz->11,025kHz with scipy.signal.resample.

So far no luck with these settings though. Realtime just starts to answer itself :frowning:

Perhaps you can mute the mic or substitute a 7FFFh (7777h if doing bytes crudely) sample stream while the audio is being played, if you want to discard the interruption ability and hear what you paid for?

Trying to interrupt UX using OpenAI’s VAD input often gets multiple creates as you are confused why it doesn’t pause and you pause yourself.

For responsiveness, one might use your own local “interruption” VAD with the audio bits shifted down 6 or 12dB. A little indicator light at 60fps of voice activity is kind of cool to watch. User can play prerecorded an AI voice loop and adjust the input level control until the indicator is almost off as part of setup, or some auto-learning of that would be possible.

(Amazon hardware echo cancellation is pretty amazing, BTW, hearing “Alexa” over music blasting from non-device speakers)