We’re streaming Twilio Voice media (8 kHz µ-law) into gpt-4o-realtime-preview via wss://api.openai.com/v1/realtime. Our bridge decodes to PCM16, upsamples to 16 kHz, and batches ~600 ms of audio before sending input_audio_buffer.append + commit. Silent buffers are now either skipped or padded with a low-amplitude keepalive tone, so every commit contains >0 ms of audio.
However, the realtime WebSocket still closes with code 1005 immediately after the model finishes the greeting (“Thanks for calling…”). At that moment:
-
OpenAI has acknowledged the append/commit (input_audio_buffer.committed).
-
The server then emits response.audio.done and response.done.
-
Within ~100 ms we receive error { code: “input_audio_buffer_commit_empty”, message: “Expected at least 100ms of audio, but buffer only has 0.00ms of audio.” } followed by a socket close (code 1005). No outbound audio packets are transmitted (outboundPacketsSent: 0).
-
Twilio closes the media stream right afterward.
Example (CallId 018dcc44-f2a4-4e93-8157-d0605593849a, event ids event_CZHrXc…):
… chunkCount: 25, byteLength: 16000, durationMs: 500, encodedBytes: 21336
input_audio_buffer.committed
response.audio_transcript.done → “Thanks for calling…”
response.done
error { code: input_audio_buffer_commit_empty, message: “…0.00ms of audio.” }
openai realtime socket closed { code: 1005, outboundPacketsSent: 0, bufferedPcmBytes: 1920, awaitingResponse: true }
Could you clarify why the server still issues input_audio_buffer_commit_empty and drops the socket even though we’ve just sent >400 ms of PCM (and your service acknowledged it)? Is there an additional requirement or timing constraint we’re missing? We need to keep the realtime channel open long enough to forward the greeting back to Twilio.