Realtime API starts to answer itself with mic+speaker setup

I’m using Realtime API with Python on Raspberry Pi 4. When I say something as input and as API responds it will pick up it’s own voice as new input and goes into chaotic loop of babbling with itself.

Are there any options to reduce this? I have tried changing the turn_detection settings like:

turn_detection=ServerVAD(type="server_vad", threshold=0.5, prefix_padding_ms=200, silence_duration_ms=200),

and

turn_detection=ServerVAD(type="server_vad", threshold=0.8, prefix_padding_ms=1000, silence_duration_ms=2000),

But options don’t seem to have much effect.

Could it be that Raspberry Pi is somewhat slower and causes enough delay for playback so it doesn’t detect that as it’s own speech?

3 Likes

Your search terms:

“pulseaudio echo cancellation module Raspberry Pi”

:blush:

3 Likes

I got the similar issue on iOS. There are few libraries we could use on the iOS. I found one of them works good.

You can find my code in github account fuwei007/OpenAIIOSRealtimeAPIDemo

1 Like

Echo cancellation seems to be the way but hard to enable.

I have not been able to get it cancel the speaker sound properly yet. Anyone else have Raspberry Pi happily using realtime API with mic+speakers and share the configs? :slight_smile:

Something I’ve tried for activating the echo cancelling module:
in /etc/pulse/default.pa:

load-module module-echo-cancel rate=11025 aec_method=webrtc source_name=aec_source source_properties=device.description=aec_source sink_name=aec_sink sink_properties=device.description=aec_sink
set-default-source aec_source
set-default-sink aec_sink

Also setting the default sample rate in /etc/pulse/daemon.conf to 11025 as my soundcard seems to work better with this. 24000 gave invalid sample rate errors.

default-sample-rate = 11025

I resample realtime API input and output audio 11,025kHz->24kHz->11,025kHz with scipy.signal.resample.

So far no luck with these settings though. Realtime just starts to answer itself :frowning:

Perhaps you can mute the mic or substitute a 7FFFh (7777h if doing bytes crudely) sample stream while the audio is being played, if you want to discard the interruption ability and hear what you paid for?

Trying to interrupt UX using OpenAI’s VAD input often gets multiple creates as you are confused why it doesn’t pause and you pause yourself.

For responsiveness, one might use your own local “interruption” VAD with the audio bits shifted down 6 or 12dB. A little indicator light at 60fps of voice activity is kind of cool to watch. User can play prerecorded an AI voice loop and adjust the input level control until the indicator is almost off as part of setup, or some auto-learning of that would be possible.

(Amazon hardware echo cancellation is pretty amazing, BTW, hearing “Alexa” over music blasting from non-device speakers)

1 Like

Yes, that was what I ended up doing to get a proper conversation to work with the RPi. I’m just muting the input mic stream if there is audio in output stream. Did it sort of like this:

# "Global" indicator when AI spoke last time
ai_last_talk_time = 0

...

# Output audio handling
def play_audio(output_stream: pyaudio.Stream):
    global ai_last_talk_time
    while True:
        audio_data = audio_output_queue.get()

        # Mark that AI is talking, do not pass input audio to audio_input_queue to avoid echo
        ai_last_talk_time = time.Time()
        
        output_stream.write(audio_data)

# Input mic audio handling
def listen_audio(input_stream: pyaudio.Stream):
    global ai_last_talk_time
    while True:
        audio_data = input_stream.read(INPUT_CHUNK_SIZE, exception_on_overflow=False)
        if audio_data is None:
            continue

        # Check if it's been more than 1 second since AI last talked to avoid echo
        if time.time() - ai_last_talk_time < 1:
            print("AI is talking, skipping input audio")
            continue

        base64_audio = base64.b64encode(audio_data).decode("utf-8")
        audio_input_queue.put(base64_audio)

...
1 Like

Hi Tommi, I have come across this thread multiple times while I attempt to implement AEC for my realtime API application on my RP5. I was wondering if you ever found an implementation for pulseaudio AEC that works for this API? I am having the same issues as you were trying to get things working, and I would LOVE to be able to use my voice assistant without headphones. Let me know!

2 Likes