Gpt-realtime-whisper rejects turn_detection despite docs showing it as the canonical example

The Realtime transcription guide shows this exact session.update as the canonical example for configuring a transcription session:

{
  "type": "session.update",
  "session": {
    "type": "transcription",
    "audio": {
      "input": {
        "format": { "type": "audio/pcm", "rate": 24000 },
        "transcription": { "model": "gpt-realtime-whisper", "language": "en" },
        "turn_detection": {
          "type": "server_vad",
          "threshold": 0.5,
          "prefix_padding_ms": 300,
          "silence_duration_ms": 500
        }
      }
    }
  }
}

But when I send that payload over WebSocket to wss://api.openai.com/v1/realtime?intent=transcription, the server replies:

{
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "code": "invalid_value",
    "message": "Turn detection is not supported for this transcription model.",
    "param": "session.audio.input.turn_detection"
  }
}

If I omit turn_detection entirely, session.updated echoes back with “turn_detection”: null — confirming server VAD is off for the session. The model then streams conversation.item.input_audio_transcription.delta events but never emits .completed, because there’s nothing endpointing turns.

For comparison, the same shape with model: “gpt-4o-transcribe” is accepted, session.updated echoes server_vad correctly, and .completed fires after each utterance.

I also tried the legacy POST /v1/realtime/transcription_sessions REST endpoint with gpt-realtime-whisper. It returns:

{
  "error": {
    "message": "Model \"gpt-realtime-whisper\" is only available on the GA API.",
    "type": "invalid_request_error",
    "param": "input_audio_transcription.model",
    "code": "invalid_model"
  }
}

-– so that path is closed too.

The dedicated gpt-realtime-whisper model page lists the model’s “Not supported” features (function calling, structured outputs, fine-tuning, predicted outputs, image, video) but does not mention turn_detection. The changelog has nothing about VAD being unavailable for this model.

Questions:

  • Is this a server-side bug, or are the docs incorrect about server_vad being supported on gpt-realtime-whisper?
  • If turn_detection is truly unsupported for this model, is the intended pattern to drive endpointing externally and send input_audio_buffer.commit manually? If so, this seems worth calling out prominently in the model page and the transcription guide.
  • Is there a different session shape or endpoint that gets server VAD working with this model?

I’d appreciate any clarification — happy to provide additional repro details (full payloads, headers, timestamps) if helpful.

1 Like