How to Manage a Multi-Agent Conversation with Different Voices Using Realtime API

Hi ,
I’m working on a multi-agent conversational system using the OpenAI Realtime API in Unity (websocket). My goal is to have a user talk to two distinct agents, each with their own personality and voice.
My requirement is each agent should respond in a different voice.
Do I need to create a separate Realtime session for each agent to assign them different voices?
If I open two sessions (one per agent), is that the correct and scalable approach?
Is there a best practice for using a judge agent (using chat/completions) to decide who speaks next and then trigger that agent’s session with response.create?

Thank you

Assistant voice cannot be changed at any point during a session. The model must have high confidence of how to respond, thus you are not allowed to have your own assistant audio input, and must reuse an ID on chat completions.

Realtime only has the facility to receive unlabeled user audio. Therefore, even a pattern of the AI hearing different speakers and responding appropriately or understanding one of them to be also a helper is not going to be a supported pattern. Thus, even providing an audio “history” would be a challenge, as any remix of audio exchanges that happened externally that you send will be seen in its entirety as one “answerable”.

Chat completions would allow you a bit more flexibility in receiving “one voice per session”, as you can add multiple turns of “user”, and there you even have a “name” field that the AI can understand. Still, you would be working against the trained patterns.

Thank you for the explanation! Do you have any recommendations for building this kind of multi-agent and user interaction given Realtime’s current capabilities?

The only facility that you have is that you can place plain text into more roles than merely “user”.

However, if you show the assistant role responding with text, it will think that the message is itself, and may drop out of responding with audio and start responding with text just as you thusly demonstrate in previous responses.

The further you pivot from using trained patterns, the more you will degrade the answer quality - or not get audio at all. So I can only suggest that you re-think the reasons why you need different voices to be heard. Or have them re-spoken with TTS from the transcript in particular cases by your code (which also eliminates client WebRTC).

Looks like this can be achieved using multiple WebSocket sessions — one per agent