[Realtime API] Input audio transcription is not showing

Hi,

After Websocket initialization i update the session and i have this response:

{
    "type": "session.updated",
    "event_id": "xxx",
    "session": {
        "id": "xxx",
        "object": "realtime.session",
        "model": "gpt-4o-realtime-preview-2024-10-01",
        "expires_at": 1728374700,
        "modalities": [
            "text",
            "audio"
        ],
        "instructions": "...",
        "voice": "shimmer",
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "prefix_padding_ms": 300,
            "silence_duration_ms": 500
        },
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm16",
        "input_audio_transcription": {
            "model": "whisper-1"
        },
        "tool_choice": "auto",
        "temperature": 0.8,
        "max_response_output_tokens": "inf",
        "tools": []
    }
}

and then i send my audio:

{
    'type': 'conversation.item.create',
    'item': {
        'type': 'message',
        'role': 'user',
        'content': [
            {
                'type': 'input_audio',
                'audio': audio_64
            }
        ]
    }
}

in the response i have everything except

conversation.item.input_audio_transcription

Can someone please help?

5 Likes

Same—I’m not seeing the audio transcription either.

Not only am i not seeing the transcription but i get an error message event from openai saying something along the lines of

Invalid parameter session.input_audio_transcription.enabled parameter doesnt exist.

I’ve both typed it in and copy and pasted directly from the docs but no luck…

as mentioned here, you can leave out the “enabled” key which resolved it for some users. For me however, this didn’t work but maybe you will have more luck. I really need the transcription as well.

To view the input audio transcript with realtime model you must first set input_audio_transcription in session.update config.

     input_audio_transcription: {
        model: 'whisper-1',
      },

Then the input transcript is available in ‘conversation.item.input_audio_transcription.completed’ response type.

Not sure why this isn’t mentioned in the realtime api docs.

I am also getting the null in the user transcription

{
  "type": "conversation.item.created",
  "event_id": "event_AkR2BLE7l9oMUumIva3Ku",
  "previous_item_id": null,
  "item": {
    "id": "item_AkR29UqpepukIR4ioIUYO",
    "object": "realtime.item",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "transcript": null
      }
    ]
  }
}

same problem “transcript” is empty. (tested with webRTC the 14/02/2025)
I have put in the session cretation:
“input_audio_transcription”: {
“model”: “whisper-1”,
“language”: “fr”
},
I have: in the message “conversation.item.created”

"content": [

  {
    "type":

“input_audio”,=
“transcript”: null
}

Ihave the same problem “transcript” is empty. (tested with webRTC the 14/02/2025)
I have put in the session cretation:
“input_audio_transcription”: {
“model”: “whisper-1”,
“language”: “fr”
},
I have: in the message “conversation.item.created”

“content”: [

{
“type”:


“input_audio”,=
“transcript”: null
}

He everyone. I have been using transcriptions extensively using WebRTC approach and the peerConnection’s data channel. I realize this is not web sockets directly, but it is similar. You can see a working example of this in TypeScript/JavaScript at the demo from this repo: GitHub - activescott/typescript-openai-realtime-api: TypeScript OpenAI Realtime API Client & Examples

Click on the demo there and check the “Transcribe User Audio” and talk and you’ll see events come back with transcriptions.

A couple things that I have noticed along the way:

  1. Make sure when you do the session request to include the input_audio_transcription field as part of the session request to get the ephemeral token. If you do not do it there, you have to send a separate session.update client event to update it with the transcriptions. This follow up session.update event will work - I use it in the example in fact.
  2. Be careful about background noise. Sometimes the Realtime API will respond to speech, but the transcriptions are wrong or blank because the Whisper-1 model used for transcriptions doesn’t interpret the speech the same way that the Realtime API model does.

I’ve only really tried this in English, although the transcriptions are really just from Whisper so anything that works with Whisper should work for transcriptions with the Realtime API too.

Hope this helps someone!

Scott