The Realtime transcription guide shows this exact session.update as the canonical example for configuring a transcription session:
{
"type": "session.update",
"session": {
"type": "transcription",
"audio": {
"input": {
"format": { "type": "audio/pcm", "rate": 24000 },
"transcription": { "model": "gpt-realtime-whisper", "language": "en" },
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500
}
}
}
}
}
But when I send that payload over WebSocket to wss://api.openai.com/v1/realtime?intent=transcription, the server replies:
{
"type": "error",
"error": {
"type": "invalid_request_error",
"code": "invalid_value",
"message": "Turn detection is not supported for this transcription model.",
"param": "session.audio.input.turn_detection"
}
}
If I omit turn_detection entirely, session.updated echoes back with “turn_detection”: null — confirming server VAD is off for the session. The model then streams conversation.item.input_audio_transcription.delta events but never emits .completed, because there’s nothing endpointing turns.
For comparison, the same shape with model: “gpt-4o-transcribe” is accepted, session.updated echoes server_vad correctly, and .completed fires after each utterance.
I also tried the legacy POST /v1/realtime/transcription_sessions REST endpoint with gpt-realtime-whisper. It returns:
{
"error": {
"message": "Model \"gpt-realtime-whisper\" is only available on the GA API.",
"type": "invalid_request_error",
"param": "input_audio_transcription.model",
"code": "invalid_model"
}
}
-– so that path is closed too.
The dedicated gpt-realtime-whisper model page lists the model’s “Not supported” features (function calling, structured outputs, fine-tuning, predicted outputs, image, video) but does not mention turn_detection. The changelog has nothing about VAD being unavailable for this model.
Questions:
- Is this a server-side bug, or are the docs incorrect about server_vad being supported on gpt-realtime-whisper?
- If turn_detection is truly unsupported for this model, is the intended pattern to drive endpointing externally and send input_audio_buffer.commit manually? If so, this seems worth calling out prominently in the model page and the transcription guide.
- Is there a different session shape or endpoint that gets server VAD working with this model?
I’d appreciate any clarification — happy to provide additional repro details (full payloads, headers, timestamps) if helpful.