I am trying to use the Realtime API in order to translate videos. The problem is, that the audio is interrupting all the time. I think this is due to the conversational-style answering mechanism of the API.
Does anyone know a way around that? At first i thought deactivating VAD will fix my problem, but when i keep commiting audio, the API will stop outputting audio and instead start processing the new audio.
An idea would be opening multiple WebSocket connections combined with client-sided VAD.