Realtime API, getUserMedia, and WebRTC - does mic audio need to be converted to PCM16 for whisper ai transcription to work?

I managed to get the realtime api working as a conversation pretty easily but I cannot get the event conversation.item.input_audio_transcription.completed to fire. I don’t get any events at all back while the user is talking other than input_audio_buffer.speech_started. Is this because the audio from getUserMedia is opus? Do I need to convert it to PCM16 for the whisper ai transcription to work? There are no bugs visible so it’s difficult to ascertain.

This is my session creation code happening on the node server via which my React client instantiates the session:

await fetch('https://api.openai.com/v1/realtime/sessions', {
      method: 'POST',
      headers: {
        Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        model: 'gpt-4o-realtime-preview-2024-12-17',
        instructions: 'You are a friendly chap called bob',
        modalities: ['audio', 'text'],
        input_audio_format: 'pcm16', // the possible options here imply it should be re-encoded
        input_audio_transcription: {
          model: 'whisper-1',
        },
        turn_detection: {
          type: 'server_vad',
          threshold: 0.5,
          prefix_padding_ms: 300,
          silence_duration_ms: 1000,
        },
        temperature: 0.8,
        voice: 'verse',
      }),
    });```