Ability to Specify Speaker Name or Source in Real-Time API for Group Sessions

Hi OpenAI Team,

I’d like to request a feature for the real-time API to better support group conversations—specifically, the ability to specify a speaker name (or some kind of speaker identifier) when submitting audio into the buffer.

Use Case:
In multi-user environments (such as live meetings, classroom discussions, or collaborative workshops), it’s common to have multiple human speakers engaging with the assistant. Right now, there’s no built-in way to indicate to the API which person is speaking at any given time. This creates challenges for both the assistant’s understanding and the quality of any generated transcript.

Potential Solutions:

  • Allow us to attach a speaker identifier with each incoming audio buffer.
  • Enable tagging of audio streams or provide support for multiple parallel audio streams (where each stream is mapped to a known participant/microphone).
  • Accept metadata (like speaker: "Alice") alongside streaming audio, so the assistant can correctly attribute each turn of dialog.

Why this matters:
Distinguishing between speakers would make conversations much more natural and accurate. This is important for applications like collaborative group chats, educational settings, telehealth, and anywhere more than one human is interfacing with the API at once.

A concrete example: In a two-mic setup, each mic is assigned to a specific participant. If I could specify speaker="Alice" for mic 1 and speaker="Bob" for mic 2 when submitting audio, the conversation context would be vastly improved.

Current Workarounds:
I’ve considered workarounds like running diarization externally and prepending transcripts with names, but this adds latency and complexity (especially in real-time scenarios). Native support would be much more accurate and developer-friendly.

Thanks for considering this! Would love to know if this is on the roadmap, or if there’s a recommended workaround I’ve missed.