Bidirectional Realtime Translation translates single utterance into multiple languages

TL;DR - I am trying to use bi-directional realtime translation in speakerphone mode, and it’s translating a single input voice into the target languages of both streams, resulting in overlapping voices from the model in different languages.

I am calling the gpt-realtime-translate API from an iOS WebRTC app, and when I speak in English with realtime translation to German enabled, it translates my speech into both English and German, which you can see in the transcript below. Note that this is with a single speaker and in speakerphone mode.


I can’t upload a video here, but here’s a link to the GH issue with a video that should make the issue clear: Bidirectional Realtime Translation · Issue #133 · tleyden/arty · GitHub

Some hypotheses on why it doesn’t work:

  1. Since I have two streams open, and the audio is being sent on both streams, it is translating both: one from English → German, and the other from English → English (which I find a bit surprising and counter-intuitive)

or ..

  1. This is an unsupported configuration because it expects completely isolated streams, and bidirectional translation doesn’t work when both streams are connected to the same mic/speaker.

I think the issue is number two. The docs say: “one translation session per direction, do not mix both callers”, but I have no choice but to mix the audio streams. I am trying to build an “in real life” translator that can translate in both directions from two different speakers, and there are no separate callers with their own mics/speakers as you would have in an online meeting, e.g., I put my phone on the table and have a two way conversation w/ another person in two different languages.

If speakerphone mode is supported for bi-directional translation, is there a way to tell the model which language it should expect to hear on the input stream? Or any other suggested approaches?

1 Like