Recently, around the last day or two, combining noise_reduction with transcription or turn_detection of semantic_vad has caused a crazy amount of latency on the RealtimeAPI. Talking up to a full MINUTE from the time the user stops speaking to the time the input_audio_buffer.speech_started event is returned once the conversation has gone on for a few minutes. It’s not present early in the session, but has happened with 100% consistency the longer the session goes.
Our setup, using webRTC:
turn_detection: semantic_vad
input_audio_noise_reduction: near_field
input_audio_transcription: whisper-1
Removing the noise reduction solves the problem, having ONLY noise reduction solves the problem, but adding either turn_detection of semantic_vad or transcription with any model reintroduces it.
We’ve tried with far_field reduction as well with the same increase in latency. We’ve also tried swapping out the transcription model to gpt-4o-mini-transcribe.
It seems to be related to the combination of noise reduction (of either type) and transcription with any model
OR
noise reduction (of either type) and turn_detection being semantic_vad
Anyone else experiencing this?