Realtime API with noise_reduction has sudden increase of latency

Vanessa_Holland · May 9, 2025, 3:06pm

Recently, around the last day or two, combining noise_reduction with transcription or turn_detection of semantic_vad has caused a crazy amount of latency on the RealtimeAPI. Talking up to a full MINUTE from the time the user stops speaking to the time the input_audio_buffer.speech_started event is returned once the conversation has gone on for a few minutes. It’s not present early in the session, but has happened with 100% consistency the longer the session goes.

Our setup, using webRTC:

turn_detection: semantic_vad
input_audio_noise_reduction: near_field
input_audio_transcription: whisper-1

Removing the noise reduction solves the problem, having ONLY noise reduction solves the problem, but adding either turn_detection of semantic_vad or transcription with any model reintroduces it.

We’ve tried with far_field reduction as well with the same increase in latency. We’ve also tried swapping out the transcription model to gpt-4o-mini-transcribe.

It seems to be related to the combination of noise reduction (of either type) and transcription with any model
OR
noise reduction (of either type) and turn_detection being semantic_vad

Anyone else experiencing this?

sps · May 9, 2025, 8:29pm

Welcome to the community, @Vanessa_Holland.

In my understanding, both the VAD choice and noise reduction affect the time it takes for the model to respond. However, such a large delay is definitely not helpful.

I tested this with semantic_vad and near_field input_audio_noise_reduction, and was able to alleviate the slow response time by setting the eagerness property for semantic_vad turn_detection to high.

Vanessa_Holland · May 9, 2025, 8:47pm

Thanks for the reply, I should have mentioned that we do have the eagerness set to high. The beginning of sessions is normal, it’s once a session gets to around 5 mins duration that there’s a spike in latency. This issue is new, we’ve had this setup for weeks without the late in session spike.

sps · May 9, 2025, 9:01pm

Thanks for confirming. It would be quite helpful to solve if you could share event logs for the delayed responses, especially the input_audio_buffer, conversation_item and response.created events.

Vanessa_Holland · May 12, 2025, 2:50pm

I’m actually able to recreate this issue with the openai/openai-realtime-console demo repo by updating the configs:

body: JSON.stringify({
      model: "gpt-4o-realtime-preview-2024-12-17",
      voice: "verse",
      turn_detection: {
        type: "semantic_vad",
        eagerness: "high"
      },
      input_audio_noise_reduction: {
        type: "near_field"
      },
    }),

Once you start a session the delay between the time the user speaks and the input_audio_buffer.speech_started will start to spike around 5 mins into the session, which is exactly what we are seeing with our implementation. Here are logs with timestamps.

Initial session runs smoothly:

Latency spikes ~5mins into session:

Topic		Replies	Views
Realtime API - Message being cutoff followed by silence Bugs	1	340	January 13, 2025
Transcribing live call (starting stream on Twilio call), after 1-4 min openAI starts to degrade Bugs transcribe	0	50	May 16, 2025
Realtime Transcription (speech-to-text) via WebRTC Extremely Delayed – Is This Expected? Bugs streaming , transcribe , stt , speech , realtime	1	55	June 1, 2025
Bad output when turn detection is not capturing complete thoughts API api-realtime	0	195	February 15, 2025
Latency in the conversation when near-field noise cancellation is enabled Bugs realtime	0	91	April 4, 2025

Realtime API with noise_reduction has sudden increase of latency

Related topics