Recently, around the last day or two, combining noise_reduction with transcription or turn_detection of semantic_vad has caused a crazy amount of latency on the RealtimeAPI. Talking up to a full MINUTE from the time the user stops speaking to the time the input_audio_buffer.speech_started event is returned once the conversation has gone on for a few minutes. It’s not present early in the session, but has happened with 100% consistency the longer the session goes.
Removing the noise reduction solves the problem, having ONLY noise reduction solves the problem, but adding either turn_detection of semantic_vad or transcription with any model reintroduces it.
We’ve tried with far_field reduction as well with the same increase in latency. We’ve also tried swapping out the transcription model to gpt-4o-mini-transcribe.
It seems to be related to the combination of noise reduction (of either type) and transcription with any model
OR
noise reduction (of either type) and turn_detection being semantic_vad
In my understanding, both the VAD choice and noise reduction affect the time it takes for the model to respond. However, such a large delay is definitely not helpful.
I tested this with semantic_vad and near_fieldinput_audio_noise_reduction, and was able to alleviate the slow response time by setting the eagerness property for semantic_vadturn_detection to high.
Thanks for the reply, I should have mentioned that we do have the eagerness set to high. The beginning of sessions is normal, it’s once a session gets to around 5 mins duration that there’s a spike in latency. This issue is new, we’ve had this setup for weeks without the late in session spike.
Thanks for confirming. It would be quite helpful to solve if you could share event logs for the delayed responses, especially the input_audio_buffer, conversation_item and response.created events.
Once you start a session the delay between the time the user speaks and the input_audio_buffer.speech_started will start to spike around 5 mins into the session, which is exactly what we are seeing with our implementation. Here are logs with timestamps.