How to Manage a Multi-Agent Conversation with Different Voices Using Realtime API

mferna23 · July 9, 2025, 1:30pm

Hi ,
I’m working on a multi-agent conversational system using the OpenAI Realtime API in Unity (websocket). My goal is to have a user talk to two distinct agents, each with their own personality and voice.
My requirement is each agent should respond in a different voice.
Do I need to create a separate Realtime session for each agent to assign them different voices?
If I open two sessions (one per agent), is that the correct and scalable approach?
Is there a best practice for using a judge agent (using chat/completions) to decide who speaks next and then trigger that agent’s session with response.create?

Thank you

_j · July 9, 2025, 2:19pm

Assistant voice cannot be changed at any point during a session. The model must have high confidence of how to respond, thus you are not allowed to have your own assistant audio input, and must reuse an ID on chat completions.

Realtime only has the facility to receive unlabeled user audio. Therefore, even a pattern of the AI hearing different speakers and responding appropriately or understanding one of them to be also a helper is not going to be a supported pattern. Thus, even providing an audio “history” would be a challenge, as any remix of audio exchanges that happened externally that you send will be seen in its entirety as one “answerable”.

Chat completions would allow you a bit more flexibility in receiving “one voice per session”, as you can add multiple turns of “user”, and there you even have a “name” field that the AI can understand. Still, you would be working against the trained patterns.

mferna23 · July 9, 2025, 2:40pm

Thank you for the explanation! Do you have any recommendations for building this kind of multi-agent and user interaction given Realtime’s current capabilities?

_j · July 9, 2025, 3:03pm

The only facility that you have is that you can place plain text into more roles than merely “user”.

However, if you show the assistant role responding with text, it will think that the message is itself, and may drop out of responding with audio and start responding with text just as you thusly demonstrate in previous responses.

The further you pivot from using trained patterns, the more you will degrade the answer quality - or not get audio at all. So I can only suggest that you re-think the reasons why you need different voices to be heard. Or have them re-spoken with TTS from the transcript in particular cases by your code (which also eliminates client WebRTC).

cihan_akpinar · July 21, 2025, 8:06pm

Looks like this can be achieved using multiple WebSocket sessions — one per agent

Topic		Replies	Views
Multi agents for HR interview bot with gpt-realtime API realtime , api-realtime , api-realtime-speech	1	241	November 3, 2025
Two realtime voice agent communication pattern API api-realtime-speech	4	294	October 3, 2025
Multi Turn and Stakholders Conversations Using Agents SDK API agents-sdk	1	248	October 16, 2025
Serving multiple users simultaneously using realtime API API realtime , api-realtime , api-realtime-speech	10	1841	March 3, 2025
Voice differences between Realtime API and Text-to-Speech API realtime , api-realtime	1	2040	January 8, 2025

How to Manage a Multi-Agent Conversation with Different Voices Using Realtime API

Related topics