Realtime transcription messages flow is wrong

I thought real-time STT was “broken,” but here’s what’s really happening for me:

  1. Model matters

    • whisper-1 is not a streaming model. It sends at most one delta, then completed.

    • Use gpt-4o-mini-transcribe or gpt-4o-transcribe if you want live, rolling delta updates.

  2. Commits trigger decoding

    • The server only starts transcribing after the input-audio buffer is committed.

      • If you turn on server_vad, the commit happens automatically after a pause (≥ silence_duration_ms).

      • If you talk non-stop, no pause ⇒ no commit ⇒ no deltas until you finally go quiet.

  3. How to get true realtime while speaking continuously

    • Keep the transcribe model, but either

      • send input_audio_buffer.commit yourself every 300-500 ms, or

      • turn off VAD (turn_detection.type = "none") and decide when to commit/end_turn in the client.

    • Smaller audio chunks (0.25-0.5 s) + periodic commits give deltas within ~0.5 s, with only a tiny accuracy hit.

  4. Deltas are stable
    Each delta just adds new tokens; the completed event is only a final “done” marker—nothing gets overwritten.

    so now I see that:
    After VAD detects a pause, the server sends a started event, then streams delta messages, and finally emits a single completed event.

{
  "input_audio_format": "pcm16",

  "input_audio_transcription": {
    "model": "gpt-4o-transcribe",   // or "gpt-4o-mini-transcribe"
    "language": "en",
    "prompt": "Transcribe the incoming audio in real time."
  },

  "turn_detection": {
    "type": "server_vad",
    "threshold": 0.05,              // RMS gate (0 = silence … 1 = loud)
    "prefix_padding_ms": 300,       // keeps first syllable from being clipped
    "silence_duration_ms": 50       // VAD fires after 50 ms of quiet
  },

  "client_chunk_size": 32000        // bytes per append (~1 sec at 16-kHz PCM)
}

2 Likes