[Realtime API WebRTC] First word consistently lost when voice feedback enabled (98% reproduction rate)

Environment:

  • API: Realtime API via WebRTC
  • Configuration: create_response: false (manual control)
  • Model: gpt-realtime-2025-08-28
  • input_audio_transcription: { model: ‘gpt-4o-mini-transcribe’ },

Issue Description:
When voice feedback is enabled (calling response.create after a tool result), 98% of voice commands lose their first word. When voice feedback is disabled, 98% of commands correctly preserve the first word.

Reproduction Steps:

  1. Configure session with turn_detection.type: "server_vad"
  2. Send a tool_result via conversation.item.create
  3. Call response.create to trigger voice feedback
  4. Immediately speak a voice command
  5. Result: First word is consistently cut off

Expected Behavior:
The prefix_padding_ms parameter should preserve audio before VAD detection, regardless of whether the agent is generating output.

Actual Behavior:
When response.create is active, the VAD appears to reset or ignore the prefix buffer, causing the first word to be lost.

Hypothesis:
The interrupt_response mode may be aggressively clearing the input buffer when transitioning from agent output to user input, ignoring the configured prefix_padding_ms.

Workaround Tested:
Increasing prefix_padding_ms from 300ms to 700-1000ms has minimal effect.

Session IDs:

Sample Case:
When confirmation is required and the user says “Yes” or “No” then this word is lost, making the overall process to fail.

When the user says “Add a Task for John Doe on 15, January” then the agent get only “Task for John Doe on 15, January”. The missing first word make the overall sentence weird

Impact:
Critical for production voice applications. This makes voice feedback unreliable.

Can you please fix this asap

UPDATE [04/01/2025]:

Problem SOLVED! After extensive debugging, I discovered the issue was in my own code, not OpenAI’s API.

Root cause: I was calling input_audio_buffer.clear inside the input_audio_buffer.speech_started handler to force interrupt the agent. This cleared the buffer WHILE the user was speaking, removing the first word that had just been detected.

Solution: Remove input_audio_buffer.clear from the speech_started handler. Using response.cancel alone is sufficient to interrupt the agent without losing user input.

Lesson learned: The input_audio_buffer.clear command should NEVER be called during active speech detection, as it clears audio that’s currently being captured.

I apologize for the noise and hope this helps others who might make the same mistake!

در تاریخ یکشنبه ۴ ژانویه ۲۰۲۶، ۱۱:۰۷ WebPlanning SIM-Ltd via OpenAI Developer Community <notifications@openai1.discoursemail.com> نوشت: