I’m using gpt-4o-realtime-review-2024-12-17 model to make a chat bot with both audio and text modalities. The problem is when it gives a response from the speaker, the bot listen itself and takes that response as an input and gives a response to what it said. Then it’s going on a loop. Looking for a solution.
Are you building on the apple ecosystem by chance? If so, you need to look into setVoiceProcessingEnabled(_:) | Apple Developer Documentation
Let me know if you are an apple dev and I’ll share what I know (I ran into this myself, and it’s quite a pain to solve, especially if you are using the AVCaptureSession API)
Thank you. But I’m not an apple dev. I’m using Linux
You need echo cancellation to prevent the output from your speaker being input to your microphone. This is an audio processing problem not anything to do with the Reamtime API (it can only work with the audio input it gets).
A quick solution is to use headphones/headset where the audio output is played through the headphones and therefore won’t get picked up by your microphone.
Yes, But for my application using headphones is not practical.
this maybe a pathologically stupid answer, and a kludge at that, but can you set a gate to take the robot’s voice as a sidechain and mute the mic while the thing is speaking?
Muting the mic is a great suggestion. There are data channel audio transcript delta events so it it feasible to know when the AI is about to start talking.
The BIG downside is that you would lose the ability to interrupt when the AI is doing the wrong thing or waffling. If that’s not required then muting would be an option.
Echo cancellation is likely to still be the proper robuts approach. That’s how the browsers do it.
Hey @louzell - I’m using iOS and have run into this issue before. We use AVAudioEngine rather than AVCaptureSession, but also had to setVoiceProcessingEnabled to true on both the input and output nodes.
We still run into occasional issues though. Can you share what you are doing? Curious to compare notes.
@kisalit This is in Java, but I think you could take the recommendations and extrapolate them to your own context:
Hey @tyler10! I put my notes up here. I didn’t want to get too Apple-specific in this thread since @kisalit is looking for a linux solution: Audio notes for OpenAI realtime on Apple platforms
I hope they help