Translation to a chosen language in real-time during a video conference

Hi All,

I am currently working on a project focused on video conferencing. One aspect of the project involves accommodating users who do not understand English by allowing them to listen in real-time to the conference in their chosen language.

How can this be achieved? Which model should I use?

If there isn’t an existing model for this purpose, could I generate text (Speech-to-Text) and then pass this text to a model to translate it directly into the selected language in real-time?

Welcome to the community, @mazen.obeid!

Currently, the transcription model on the API does not support streaming audio responses, which may result in significant lag. There is also a translation endpoint that currently on translates to English, but may support other languages in the future.

For now, you will need to deploy your own instance of Whisper locally for streaming STT.

Afterward, you will need to implement streaming translations.

EDIT: META has already conducted research on this topic and have functioning models - Seamless communication.

1 Like

The OpenAI whisper model does not have a real-time mode that’s publicly available.
I have done something similar using other services on the web, and my current choice is AssemblyAI, but there are several others, like Deepgram, Rev, AWS Transcribe, and so on. (Plugging one of these names into Google will typically return paid ads for all the others, because they bid on each others traffic. And Google caches the check :-D)

1 Like