I am currently working on a project focused on video conferencing. One aspect of the project involves accommodating users who do not understand English by allowing them to listen in real-time to the conference in their chosen language.
How can this be achieved? Which model should I use?
If there isn’t an existing model for this purpose, could I generate text (Speech-to-Text) and then pass this text to a model to translate it directly into the selected language in real-time?
Currently, the transcription model on the API does not support streaming audio responses, which may result in significant lag. There is also a translation endpoint that currently on translates to English, but may support other languages in the future.
The OpenAI whisper model does not have a real-time mode that’s publicly available.
I have done something similar using other services on the web, and my current choice is AssemblyAI, but there are several others, like Deepgram, Rev, AWS Transcribe, and so on. (Plugging one of these names into Google will typically return paid ads for all the others, because they bid on each others traffic. And Google caches the check :-D)