Missing input audio transcription

Currently, I send the audio input and receive back the audio and text output. However, I also want my audio input to be transcribed.

What I’m currently doing:

1. connect through Websocket, receive this back:

{
    "type": "session.created",
    "event_id": "event_BRsiKo95C4wtXJ9H37LvN",
    "session": {
        "id": "sess_BRsiKFh7sPhe51tcyYnUp",
        "object": "realtime.session",
        "expires_at": 1745986676,
        "input_audio_noise_reduction": null,
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "prefix_padding_ms": 300,
            "silence_duration_ms": 200,
            "create_response": true,
            "interrupt_response": true
        },
        "input_audio_format": "pcm16",
        "input_audio_transcription": null,
        "client_secret": null,
        "include": null,
        "model": "gpt-4o-realtime-preview",
        "modalities": [
            "audio",
            "text"
        ],
        "instructions": "[default instructions]",
        "voice": "alloy",
        "output_audio_format": "pcm16",
        "tool_choice": "auto",
        "temperature": 0.8,
        "max_response_output_tokens": "inf",
        "tools": []
    }
}

2. Set the input_audio_transcription, receive this back:

{
    "type": "session.updated",
    "event_id": "event_BRsiTisnXbF5OfXOC7R1s",
    "session": {
        "id": "sess_BRsiKFh7sPhe51tcyYnUp",
        "object": "realtime.session",
        "expires_at": 1745986676,
        "input_audio_noise_reduction": null,
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "prefix_padding_ms": 300,
            "silence_duration_ms": 200,
            "create_response": true,
            "interrupt_response": true
        },
        "input_audio_format": "pcm16",
        "input_audio_transcription": {
            "model": "gpt-4o-transcribe",
            "language": "en",
            "prompt": null
        },
        "client_secret": null,
        "include": null,
        "model": "gpt-4o-realtime-preview",
        "modalities": [
            "audio",
            "text"
        ],
        "instructions": "[default instructions]",
        "voice": "alloy",
        "output_audio_format": "pcm16",
        "tool_choice": "auto",
        "temperature": 0.8,
        "max_response_output_tokens": "inf",
        "tools": []
    }
}

3. Create the conversation item, receive this back:

{
  "type" : "conversation.item.create",
  "item" : {
    "type" : "message",
    "role" : "user",
    "content" : [ {
      "type" : "input_audio",
      "audio" : "[audio data]"
    } ]
  }
}

4. request a response using

{
  "type" : "response.create",
  "response" : {
    "modalities" : [ "text", "audio" ]
  }
}
  1. I receive the audio and text of the output

However, I don’t see the transcription of the input anywhere. I searched through Postman and through the debugger and found nothing.

What am I missing?

It is strange isn’t it: you’d want to have a text copy of both sides of the conversation, like ChatGPT has when using voice, a record and log of both sides.

The realtime API does not perform the additional transcription on the user input that is sent.

Here’s the method where you can programmatically receive the user audio that was employed as input from a turn:

https://platform.openai.com/docs/api-reference/realtime-client-events/conversation/item/retrieve

Send this event when you want to retrieve the server’s representation of a specific item in the conversation history. This is useful, for example, to inspect user audio after noise cancellation and VAD.

Then while the session is active, you can be calling on those individual turns, get the sliced out audio that was used by the AI model. And then you would perform your own transcription using Whisper or the transcriptions endpoint.

I hope that gives you more avenue to pursue.

If you use your own voice detection and turn triggering, you’ll naturally have your own copy of the user audio that you are sending and then can process yourself in different ways.

Thank you.

  1. Since I already have the audio recording that I’m about to send, why would I need the retrieve event?
  2. What is the input_audio_transcription for?

I might be missing what is already there!

This is type of object/event that is returned by the just mentioned message retrieval, where transcription options for the model would obviously be part of the configuration one must do:

{
    "event_id": "event_1920",
    "type": "conversation.item.created",
    "previous_item_id": "msg_002",
    "item": {
        "id": "msg_003",
        "object": "realtime.item",
        "type": "message",
        "status": "completed",
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "transcript": "hello how are you",
                "audio": "base64encodedaudio=="
            }
        ]

If you can get the audio with conversation item retrieval - then it is just getting that transcription fulfilled, then retrieving the message after any time delays of internally running that separate transcription to complete the object fields.

Try that out!

Weirdly enough, I get null for the transcript.

This is how I update the session:

{
  "type" : "session.update",
  "session" : {
    "input_audio_transcription" : {
      "language" : "en",
      "model" : "gpt-4o-transcribe"
    }
  }
}

After which I receive the confirmation:

{
    "type": "session.updated",
    "event_id": "event_BRtpI1gNl1ZoakhAXm7b5",
    "session": {
        "id": "sess_BRtopWsFOrlMXaIpErLBv",
        "object": "realtime.session",
        "expires_at": 1745990923,
        "input_audio_noise_reduction": null,
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "prefix_padding_ms": 300,
            "silence_duration_ms": 200,
            "create_response": true,
            "interrupt_response": true
        },
        "input_audio_format": "pcm16",
        "input_audio_transcription": {
            "model": "gpt-4o-transcribe",
            "language": "en",
            "prompt": null
        },
        "client_secret": null,
        "include": null,
        "model": "gpt-4o-realtime-preview",
        "modalities": [
            "text",
            "audio"
        ],
        "instructions": "...",
        "voice": "alloy",
        "output_audio_format": "pcm16",
        "tool_choice": "auto",
        "temperature": 0.8,
        "max_response_output_tokens": "inf",
        "tools": []
    }
}

Then I send the audio, receive the id:

{
    "type": "conversation.item.created",
    "event_id": "event_BRtpXxr60fub8S4QpQiH0",
    "previous_item_id": null,
    "item": {
        "id": "item_BRtpXRq45bzG1XjoZ9kRQ",
        "object": "realtime.item",
        "type": "message",
        "status": "completed",
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "transcript": null
            }
        ]
    }
}

And finally, I retrieve:

{
    "type": "conversation.item.retrieve",
    "item_id": "item_BRtpXRq45bzG1XjoZ9kRQ"
}

And receive:

"content": [
            {
                "type": "input_audio",
                "transcript": null,
                "audio": "..."
                "format": "pcm16"
            }
        ]
    }
}

The transcript is null. So looks like I have the transcription setting wrong somehow. Do you have any idea why?

openai sends the user transcripts in separate events called ‘conversation.item.input_audio_transcription.completed’

see: https://platform.openai.com/docs/api-reference/realtime-server-events/conversation/item/input_audio_transcription

hope that helps

Unfortunately, neither does my debugger stop at that event (so I don’t receive it) nor does Postman have any such event.

1 Like