I’m using Realtime API with Python on Raspberry Pi 4. When I say something as input and as API responds it will pick up it’s own voice as new input and goes into chaotic loop of babbling with itself.
Are there any options to reduce this? I have tried changing the turn_detection settings like:
Echo cancellation seems to be the way but hard to enable.
I have not been able to get it cancel the speaker sound properly yet. Anyone else have Raspberry Pi happily using realtime API with mic+speakers and share the configs?
Something I’ve tried for activating the echo cancelling module:
in /etc/pulse/default.pa:
Also setting the default sample rate in /etc/pulse/daemon.conf to 11025 as my soundcard seems to work better with this. 24000 gave invalid sample rate errors.
default-sample-rate = 11025
I resample realtime API input and output audio 11,025kHz->24kHz->11,025kHz with scipy.signal.resample.
So far no luck with these settings though. Realtime just starts to answer itself
Perhaps you can mute the mic or substitute a 7FFFh (7777h if doing bytes crudely) sample stream while the audio is being played, if you want to discard the interruption ability and hear what you paid for?
Trying to interrupt UX using OpenAI’s VAD input often gets multiple creates as you are confused why it doesn’t pause and you pause yourself.
For responsiveness, one might use your own local “interruption” VAD with the audio bits shifted down 6 or 12dB. A little indicator light at 60fps of voice activity is kind of cool to watch. User can play prerecorded an AI voice loop and adjust the input level control until the indicator is almost off as part of setup, or some auto-learning of that would be possible.
(Amazon hardware echo cancellation is pretty amazing, BTW, hearing “Alexa” over music blasting from non-device speakers)
Yes, that was what I ended up doing to get a proper conversation to work with the RPi. I’m just muting the input mic stream if there is audio in output stream. Did it sort of like this:
# "Global" indicator when AI spoke last time
ai_last_talk_time = 0
...
# Output audio handling
def play_audio(output_stream: pyaudio.Stream):
global ai_last_talk_time
while True:
audio_data = audio_output_queue.get()
# Mark that AI is talking, do not pass input audio to audio_input_queue to avoid echo
ai_last_talk_time = time.Time()
output_stream.write(audio_data)
# Input mic audio handling
def listen_audio(input_stream: pyaudio.Stream):
global ai_last_talk_time
while True:
audio_data = input_stream.read(INPUT_CHUNK_SIZE, exception_on_overflow=False)
if audio_data is None:
continue
# Check if it's been more than 1 second since AI last talked to avoid echo
if time.time() - ai_last_talk_time < 1:
print("AI is talking, skipping input audio")
continue
base64_audio = base64.b64encode(audio_data).decode("utf-8")
audio_input_queue.put(base64_audio)
...
Hi Tommi, I have come across this thread multiple times while I attempt to implement AEC for my realtime API application on my RP5. I was wondering if you ever found an implementation for pulseaudio AEC that works for this API? I am having the same issues as you were trying to get things working, and I would LOVE to be able to use my voice assistant without headphones. Let me know!