Environment:
- API: Realtime API via WebRTC
- Configuration:
create_response: false(manual control) - Model: gpt-realtime-2025-08-28
- input_audio_transcription: { model: ‘gpt-4o-mini-transcribe’ },
Issue Description:
When voice feedback is enabled (calling response.create after a tool result), 98% of voice commands lose their first word. When voice feedback is disabled, 98% of commands correctly preserve the first word.
Reproduction Steps:
- Configure session with
turn_detection.type: "server_vad" - Send a tool_result via
conversation.item.create - Call
response.createto trigger voice feedback - Immediately speak a voice command
- Result: First word is consistently cut off
Expected Behavior:
The prefix_padding_ms parameter should preserve audio before VAD detection, regardless of whether the agent is generating output.
Actual Behavior:
When response.create is active, the VAD appears to reset or ignore the prefix buffer, causing the first word to be lost.
Hypothesis:
The interrupt_response mode may be aggressively clearing the input buffer when transitioning from agent output to user input, ignoring the configured prefix_padding_ms.
Workaround Tested:
Increasing prefix_padding_ms from 300ms to 700-1000ms has minimal effect.
Session IDs:
Sample Case:
When confirmation is required and the user says “Yes” or “No” then this word is lost, making the overall process to fail.
When the user says “Add a Task for John Doe on 15, January” then the agent get only “Task for John Doe on 15, January”. The missing first word make the overall sentence weird
Impact:
Critical for production voice applications. This makes voice feedback unreliable.
Can you please fix this asap