Input_audio_transcription server events stopped occuring

I know that many people have asked about transcription in realtime on the forum, and I’ve worked through that issue in the past, but something new seems to be happening for me.

Previously, when using the realtime BETA I had similar issues to others but was able to resolve it. The transcriptions were badly innacurate and broken up, but it worked.

When the v1 full release occurred, suddenly, the transcriptions turned way more accurate for me and came through more cleanly.

However, recently, something has happened and I no longer get any conversation.item.input_audio_transcription.* events.
( I log every event type that comes through and none do).

My server provides an ephemeral token like this, which works.

const sessionConfig = JSON.stringify({
      session: {
          type: "realtime",
          model: "gpt-realtime",
          audio: {
              output: { voice: "marin" },
          },
      },
    });

    const response = await fetch("https://api.openai.com/v1/realtime/client_secrets", {
      method: "POST",
      headers: {
        "Authorization": `Bearer ${openaiApiKey}`,
        "Content-Type": "application/json"
      },
      body: sessionConfig,
    })

My client WebRTC connection is established like this, and updates like below upon receivingt a session.created event — which all works. Audio conversation and response transcriptions all come through.

const url = `https://api.openai.com/v1/realtime/calls?model=${model}`;
            const response = await fetch(url, {
                method: 'POST',
                headers: {
                    'Content-Type': 'application/sdp',
                    'Accept': 'application/sdp',
                    'Authorization': `Bearer ${ephemeralToken}`
                },
                body: offer.sdp
            });
const sessionUpdateEvent = {
            event_id: Crypto.randomUUID(),
            type: "session.update",
            session: {
                modalities: ["text", "audio"],
                instructions: this.socketConfig.prompt,
                input_audio_format: "pcm16",
                output_audio_format: "pcm16",
                // input_audio_transcription: { model: "gpt-4o-mini-transcribe", language: "en" },
                input_audio_transcription: { model: "whisper-1" },
                voice: this.sessionConfig?.voiceId || 'marin',
                turn_detection: turnDetection,
                temperature: 0.8,
                max_response_output_tokens: 4096
            },
        }
this.sendJson(sessionUpdateEvent);


public sendJson(data: any): void {
        if (this.dataChannel?.readyState === 'open') {
            this.dataChannel.send(JSON.stringify(data));
        } else {
            console.error('Data channel is not open. Cannot send data.');
        }
    }

And yet, for some reason, just don’t get any input transcription events any more.
Does anyone know what I’m doing wrong?

Here’s an example console log:

📨 Event type: input_audio_buffer.speech_started
📨 Event type: input_audio_buffer.speech_stopped
📨 Event type: input_audio_buffer.committed
📨 Event type: conversation.item.added
{
  "type": "conversation.item.added",
  "event_id": "event_CKmMAY5RZlCxH2el9gdsh",
  "previous_item_id": "item_CKmM63o6FnE6EXwMq1sqr",
  "item": {
    "id": "item_CKmM82dqcGP0Guh7Z20bQ",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "transcript": null
      }
    ]
  }
}
📨 Event type: conversation.item.done
{
  "type": "conversation.item.done",
  "event_id": "event_CKmMAvdOcHDEgPpmyZxdV",
  "previous_item_id": "item_CKmM63o6FnE6EXwMq1sqr",
  "item": {
    "id": "item_CKmM82dqcGP0Guh7Z20bQ",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "transcript": null
      }
    ]
  }
}
📨 Event type: response.created
📨 Event type: response.output_item.added
📨 Event type: conversation.item.added
{
  "type": "conversation.item.added",
  "event_id": "event_CKmMBbnC63FblNTAWc56W",
  "previous_item_id": "item_CKmM82dqcGP0Guh7Z20bQ",
  "item": {
    "id": "item_CKmMAiMqMQ7E6zHEZYDNE",
    "type": "message",
    "status": "in_progress",
    "role": "assistant",
    "content": []
  }
}
📨 Event type: response.content_part.added
📨 Event type: response.output_audio_transcript.delta
📨 Event type: output_audio_buffer.started
📨 Event type: response.output_audio_transcript.delta
📨 Event type: response.output_audio.done
📨 Event type: response.output_audio_transcript.done
📨 Event type: response.content_part.done
📨 Event type: conversation.item.done
{
  "type": "conversation.item.done",
  "event_id": "event_CKmMC05Kx2EWticHXiipV",
  "previous_item_id": "item_CKmM82dqcGP0Guh7Z20bQ",
  "item": {
    "id": "item_CKmMAiMqMQ7E6zHEZYDNE",
    "type": "message",
    "status": "completed",
    "role": "assistant",
    "content": [
      {
        "type": "output_audio",
        "transcript": "Hey there! Loud and clear, your test is coming through perfectly. How can I help you today?"
      }
    ]
  }
}
📨 Event type: response.output_item.done

As you can see above, I get audio buffer events, and response events, and some conversation events, but not input transcriptions.

There were a lot of changes since the v1 release and removing the beta flag. Here’s what’s working for me… note that the model is specified differently:

            "type": "session.update",
            "session": {
                "type": "realtime",
                "model": "gpt-realtime",
                "audio": {
                    "input": {
                        "format": {          
                            "type": "audio/pcm",
                            "rate": 24000,
                        },
                        "noise_reduction": {"type":"far_field"},
                        "transcription": {
                            "model": "gpt-4o-mini-transcribe" 
                            }...

with that in place, I get

Assistant audio transcript in response.output_audio_transcript.done

and User audio transcript in conversation.item.input_audio_transcription.completed

Thank you! That was really helpful… I was able to look more clearly understand the docs as a result.
I did run into one issue (Which I think I saw on another thread previously).

The transcriptions still didn’t come through until I added the transcription config settings into the “create ephemeral token” code on the server as well.

For anyone else that lands here… I’ve found this to be true with adjusting the turn detection values as well — It only pays attention to the ones sets on the server when creating the ephemeral token.

I imagine it’s true with several settings.