Realtime API, getUserMedia, and WebRTC - does mic audio need to be converted to PCM16 for whisper ai transcription to work?

emioli · February 7, 2025, 12:53am

I managed to get the realtime api working as a conversation pretty easily but I cannot get the event conversation.item.input_audio_transcription.completed to fire. I don’t get any events at all back while the user is talking other than input_audio_buffer.speech_started. Is this because the audio from getUserMedia is opus? Do I need to convert it to PCM16 for the whisper ai transcription to work? There are no bugs visible so it’s difficult to ascertain.

This is my session creation code happening on the node server via which my React client instantiates the session:

await fetch('https://api.openai.com/v1/realtime/sessions', {
      method: 'POST',
      headers: {
        Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        model: 'gpt-4o-realtime-preview-2024-12-17',
        instructions: 'You are a friendly chap called bob',
        modalities: ['audio', 'text'],
        input_audio_format: 'pcm16', // the possible options here imply it should be re-encoded
        input_audio_transcription: {
          model: 'whisper-1',
        },
        turn_detection: {
          type: 'server_vad',
          threshold: 0.5,
          prefix_padding_ms: 300,
          silence_duration_ms: 1000,
        },
        temperature: 0.8,
        voice: 'verse',
      }),
    });```

Topic		Replies	Views
[Realtime API] Input audio transcription is not showing Bugs realtime	11	2713	May 12, 2025
Missing input audio transcription API api-realtime	6	130	May 12, 2025
Input_audio_transcription not working in Real-Time — related to g711_ulaw? Bugs realtime	7	1537	December 26, 2024
Can't get the user transcription in realtime api API transcribe , realtime	8	1532	May 29, 2025
Unable to Access User Audio Transcript in Realtime API API api-realtime	5	1486	February 10, 2025

Realtime API, getUserMedia, and WebRTC - does mic audio need to be converted to PCM16 for whisper ai transcription to work?

Related topics