Hello all, I’m trying to make use of the real-time voice API, and specifically for my demonstration, I want two voice agents to interact with each other in two different roles.
I have tried various things, but I’m unable to create a pattern where I can make use of VAD and create a natural conversation between two AI agents. Either the second agent never responds, or they start speaking halfway through the first agent.
Sounds like you are doing the code equivalent of holding two phones together.
Server VAD is going to be poor. It will be much better to trigger a create after the buffer is loaded and the first output is response.done.
Better would just be to make Chat Completions calls with audio messages and modality to an audio-preview model. You can then start with text-to-text and see the conversation degrade to loops at lower expense.
Curious where this went for you. I agree this is a doozy of a challenge. I’d try something like pipecat’s smart turn base_smart_turn — pipecat-ai documentation which uses ML for turn detection. They have a wraparound for the realtime model OpenAI Realtime - Pipecat . But it might take a more robust setup like webRTC, and coming out of the same speaker and client audio I would imagine there’s no quick fix. Maybe simulate multiple clients somehow. IDK, might give this a try too
I’ve done this before but with two devices. Its super fun, especially if you mix models and give them each a different personality. With a single device, the hard part is getting the audio streams to not stomp on each other, but on two devices, its easy. I use raspberry pi devices so its super cheap but hardware is not an option for most people.
I’m guessing you could get it to work well on a PC or mac if you can separate the audio streams in hardware. Attach two bluetooth speakerphones (so the agent does not hear itself talk), and assign one to each agent.