Missing input audio transcription

lenduya · April 30, 2025, 3:55am

Currently, I send the audio input and receive back the audio and text output. However, I also want my audio input to be transcribed.

What I’m currently doing:

1. connect through Websocket, receive this back:

{
    "type": "session.created",
    "event_id": "event_BRsiKo95C4wtXJ9H37LvN",
    "session": {
        "id": "sess_BRsiKFh7sPhe51tcyYnUp",
        "object": "realtime.session",
        "expires_at": 1745986676,
        "input_audio_noise_reduction": null,
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "prefix_padding_ms": 300,
            "silence_duration_ms": 200,
            "create_response": true,
            "interrupt_response": true
        },
        "input_audio_format": "pcm16",
        "input_audio_transcription": null,
        "client_secret": null,
        "include": null,
        "model": "gpt-4o-realtime-preview",
        "modalities": [
            "audio",
            "text"
        ],
        "instructions": "[default instructions]",
        "voice": "alloy",
        "output_audio_format": "pcm16",
        "tool_choice": "auto",
        "temperature": 0.8,
        "max_response_output_tokens": "inf",
        "tools": []
    }
}

2. Set the input_audio_transcription, receive this back:

{
    "type": "session.updated",
    "event_id": "event_BRsiTisnXbF5OfXOC7R1s",
    "session": {
        "id": "sess_BRsiKFh7sPhe51tcyYnUp",
        "object": "realtime.session",
        "expires_at": 1745986676,
        "input_audio_noise_reduction": null,
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "prefix_padding_ms": 300,
            "silence_duration_ms": 200,
            "create_response": true,
            "interrupt_response": true
        },
        "input_audio_format": "pcm16",
        "input_audio_transcription": {
            "model": "gpt-4o-transcribe",
            "language": "en",
            "prompt": null
        },
        "client_secret": null,
        "include": null,
        "model": "gpt-4o-realtime-preview",
        "modalities": [
            "audio",
            "text"
        ],
        "instructions": "[default instructions]",
        "voice": "alloy",
        "output_audio_format": "pcm16",
        "tool_choice": "auto",
        "temperature": 0.8,
        "max_response_output_tokens": "inf",
        "tools": []
    }
}

3. Create the conversation item, receive this back:

{
  "type" : "conversation.item.create",
  "item" : {
    "type" : "message",
    "role" : "user",
    "content" : [ {
      "type" : "input_audio",
      "audio" : "[audio data]"
    } ]
  }
}

4. request a response using

{
  "type" : "response.create",
  "response" : {
    "modalities" : [ "text", "audio" ]
  }
}

I receive the audio and text of the output

However, I don’t see the transcription of the input anywhere. I searched through Postman and through the debugger and found nothing.

What am I missing?

_j · April 30, 2025, 4:04am

It is strange isn’t it: you’d want to have a text copy of both sides of the conversation, like ChatGPT has when using voice, a record and log of both sides.

The realtime API does not perform the additional transcription on the user input that is sent.

Here’s the method where you can programmatically receive the user audio that was employed as input from a turn:

https://platform.openai.com/docs/api-reference/realtime-client-events/conversation/item/retrieve

Send this event when you want to retrieve the server’s representation of a specific item in the conversation history. This is useful, for example, to inspect user audio after noise cancellation and VAD.

Then while the session is active, you can be calling on those individual turns, get the sliced out audio that was used by the AI model. And then you would perform your own transcription using Whisper or the transcriptions endpoint.

I hope that gives you more avenue to pursue.

If you use your own voice detection and turn triggering, you’ll naturally have your own copy of the user audio that you are sending and then can process yourself in different ways.

lenduya · April 30, 2025, 4:26am

Thank you.

Since I already have the audio recording that I’m about to send, why would I need the retrieve event?
What is the input_audio_transcription for?

_j · April 30, 2025, 4:32am

I might be missing what is already there!

This is type of object/event that is returned by the just mentioned message retrieval, where transcription options for the model would obviously be part of the configuration one must do:

{
    "event_id": "event_1920",
    "type": "conversation.item.created",
    "previous_item_id": "msg_002",
    "item": {
        "id": "msg_003",
        "object": "realtime.item",
        "type": "message",
        "status": "completed",
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "transcript": "hello how are you",
                "audio": "base64encodedaudio=="
            }
        ]

If you can get the audio with conversation item retrieval - then it is just getting that transcription fulfilled, then retrieving the message after any time delays of internally running that separate transcription to complete the object fields.

Try that out!

lenduya · April 30, 2025, 5:06am

Weirdly enough, I get null for the transcript.

This is how I update the session:

{
  "type" : "session.update",
  "session" : {
    "input_audio_transcription" : {
      "language" : "en",
      "model" : "gpt-4o-transcribe"
    }
  }
}

After which I receive the confirmation:

{
    "type": "session.updated",
    "event_id": "event_BRtpI1gNl1ZoakhAXm7b5",
    "session": {
        "id": "sess_BRtopWsFOrlMXaIpErLBv",
        "object": "realtime.session",
        "expires_at": 1745990923,
        "input_audio_noise_reduction": null,
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "prefix_padding_ms": 300,
            "silence_duration_ms": 200,
            "create_response": true,
            "interrupt_response": true
        },
        "input_audio_format": "pcm16",
        "input_audio_transcription": {
            "model": "gpt-4o-transcribe",
            "language": "en",
            "prompt": null
        },
        "client_secret": null,
        "include": null,
        "model": "gpt-4o-realtime-preview",
        "modalities": [
            "text",
            "audio"
        ],
        "instructions": "...",
        "voice": "alloy",
        "output_audio_format": "pcm16",
        "tool_choice": "auto",
        "temperature": 0.8,
        "max_response_output_tokens": "inf",
        "tools": []
    }
}

Then I send the audio, receive the id:

{
    "type": "conversation.item.created",
    "event_id": "event_BRtpXxr60fub8S4QpQiH0",
    "previous_item_id": null,
    "item": {
        "id": "item_BRtpXRq45bzG1XjoZ9kRQ",
        "object": "realtime.item",
        "type": "message",
        "status": "completed",
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "transcript": null
            }
        ]
    }
}

And finally, I retrieve:

{
    "type": "conversation.item.retrieve",
    "item_id": "item_BRtpXRq45bzG1XjoZ9kRQ"
}

And receive:

"content": [
            {
                "type": "input_audio",
                "transcript": null,
                "audio": "..."
                "format": "pcm16"
            }
        ]
    }
}

The transcript is null. So looks like I have the transcription setting wrong somehow. Do you have any idea why?

Serkan_Satir · May 12, 2025, 12:45pm

openai sends the user transcripts in separate events called ‘conversation.item.input_audio_transcription.completed’

see: https://platform.openai.com/docs/api-reference/realtime-server-events/conversation/item/input_audio_transcription

hope that helps

lenduya · May 12, 2025, 2:43pm

Unfortunately, neither does my debugger stop at that event (so I don’t receive it) nor does Postman have any such event.

Topic		Replies	Views
Content transcript is null with Retrieve event API api	0	148	May 5, 2025
[Realtime API] Input audio transcription is not showing Bugs realtime	12	4441	July 3, 2025
Input_audio_transcription server events stopped occuring API	3	198	September 29, 2025
How to get input_audio_transcription when i use openai realtime api API realtime , api-realtime , api-realtime-speech	2	786	November 16, 2025
Retrieving user response from Realtime Voice WebRTC API api	14	914	January 11, 2025

Missing input audio transcription

Related topics