I thought real-time STT was “broken,” but here’s what’s really happening for me:
-
Model matters
-
whisper-1is not a streaming model. It sends at most onedelta, thencompleted. -
Use
gpt-4o-mini-transcribeorgpt-4o-transcribeif you want live, rollingdeltaupdates.
-
-
Commits trigger decoding
-
The server only starts transcribing after the input-audio buffer is committed.
-
If you turn on
server_vad, the commit happens automatically after a pause (≥silence_duration_ms). -
If you talk non-stop, no pause ⇒ no commit ⇒ no deltas until you finally go quiet.
-
-
-
How to get true realtime while speaking continuously
-
Keep the transcribe model, but either
-
send
input_audio_buffer.commityourself every 300-500 ms, or -
turn off VAD (
turn_detection.type = "none") and decide when to commit/end_turn in the client.
-
-
Smaller audio chunks (0.25-0.5 s) + periodic commits give deltas within ~0.5 s, with only a tiny accuracy hit.
-
-
Deltas are stable
Eachdeltajust adds new tokens; thecompletedevent is only a final “done” marker—nothing gets overwritten.so now I see that:
After VAD detects a pause, the server sends astartedevent, then streamsdeltamessages, and finally emits a singlecompletedevent.
{
"input_audio_format": "pcm16",
"input_audio_transcription": {
"model": "gpt-4o-transcribe", // or "gpt-4o-mini-transcribe"
"language": "en",
"prompt": "Transcribe the incoming audio in real time."
},
"turn_detection": {
"type": "server_vad",
"threshold": 0.05, // RMS gate (0 = silence … 1 = loud)
"prefix_padding_ms": 300, // keeps first syllable from being clipped
"silence_duration_ms": 50 // VAD fires after 50 ms of quiet
},
"client_chunk_size": 32000 // bytes per append (~1 sec at 16-kHz PCM)
}