RealTime GA text transcription

Hi OpenAI Team,

I’m writing to request the restoration of the input_audio_transcription feature that was available in the Realtime API beta but removed in the GA release.

What We Had in Beta:

// Session configuration (Beta)
{
  type: "session.update",
  session: {
    modalities: ["text", "audio"],
    turn_detection: { type: "server_vad" },
    input_audio_transcription: { model: "whisper-1" },  // ✅ This existed
    input_audio_format: "g711_ulaw",
    output_audio_format: "g711_ulaw"
  }
}

This would generate conversation.item.input_audio_transcription.completed events containing real-time user speech transcripts within the same session.

What Changed in GA:

The GA release removed input_audio_transcription entirely. Multiple developers in the community have reported this issue:

  • “Input_audio_transcription in realtime-api”
  • “Unable to Access User Audio Transcript in Realtime API”
  • “Problems using session.update with the realtime-api”

Questions:

  • Is input_audio_transcription removal permanent or temporary?
  • What is the recommended approach for real-time user transcription in GA?
  • Are there alternative methods we should be using?

The community would greatly appreciate clarity on this feature’s status.

Thank you for considering this request.

Best regards,
Rohan

Rohan, the messages are not gone, you have to request them by providing a transcription model. I get

        session_update_message = {
            "type": "session.update",
            "session": {
                "type": "realtime",
                "model": "gpt-realtime",
                "audio": {
                    "input": {
                        "format": {          
                            "type": "audio/pcm",
                            "rate": NATIVE_OAI_SAMPLE_RATE_HZ
                        },
                        "noise_reduction": {"type":"far_field"},
                        "transcription": {
                            "model": "gpt-4o-mini-transcribe"
                            },
                        "turn_detection": {
                            "create_response": True,
                            "interrupt_response": False,
                            "prefix_padding_ms": 300,
                            "silence_duration_ms": 700,
                            "threshold": 0.5,
                            "type": "server_vad"
                            }
                        },
                    "output": {
                        "format": {
                            "type": "audio/pcm",
                            "rate": 24000
                        },
                        "speed":1,
                        "voice": "marin",
                    }
                },
                "instructions": sp,
                "max_output_tokens": 4096,
                "output_modalities": ["audio"],
                "tool_choice": "auto",
                "tools":[],
                "tracing": None,
                "truncation":"auto"
            }
        }

then I catch these events:

    if event_type == 'response.output_audio_transcript.done':
        # ai transcript

    elif event_type == 'conversation.item.input_audio_transcription.completed':
        # human transcript


hope this helps!!

1 Like

Yes! That worked! Thank you so much.