Realtime API with noise_reduction has sudden increase of latency

Recently, around the last day or two, combining noise_reduction with transcription or turn_detection of semantic_vad has caused a crazy amount of latency on the RealtimeAPI. Talking up to a full MINUTE from the time the user stops speaking to the time the input_audio_buffer.speech_started event is returned once the conversation has gone on for a few minutes. It’s not present early in the session, but has happened with 100% consistency the longer the session goes.

Our setup, using webRTC:

turn_detection: semantic_vad
input_audio_noise_reduction: near_field
input_audio_transcription: whisper-1

Removing the noise reduction solves the problem, having ONLY noise reduction solves the problem, but adding either turn_detection of semantic_vad or transcription with any model reintroduces it.

We’ve tried with far_field reduction as well with the same increase in latency. We’ve also tried swapping out the transcription model to gpt-4o-mini-transcribe.

It seems to be related to the combination of noise reduction (of either type) and transcription with any model
OR
noise reduction (of either type) and turn_detection being semantic_vad

Anyone else experiencing this?

1 Like

Welcome to the community, @Vanessa_Holland.

In my understanding, both the VAD choice and noise reduction affect the time it takes for the model to respond. However, such a large delay is definitely not helpful.

I tested this with semantic_vad and near_field input_audio_noise_reduction, and was able to alleviate the slow response time by setting the eagerness property for semantic_vad turn_detection to high.

1 Like

Thanks for the reply, I should have mentioned that we do have the eagerness set to high. The beginning of sessions is normal, it’s once a session gets to around 5 mins duration that there’s a spike in latency. This issue is new, we’ve had this setup for weeks without the late in session spike.

Thanks for confirming. It would be quite helpful to solve if you could share event logs for the delayed responses, especially the input_audio_buffer, conversation_item and response.created events.