[Realtime API] Input audio transcription is not showing

chouaib06 · October 8, 2024, 3:21pm

Hi,

After Websocket initialization i update the session and i have this response:

{
    "type": "session.updated",
    "event_id": "xxx",
    "session": {
        "id": "xxx",
        "object": "realtime.session",
        "model": "gpt-4o-realtime-preview-2024-10-01",
        "expires_at": 1728374700,
        "modalities": [
            "text",
            "audio"
        ],
        "instructions": "...",
        "voice": "shimmer",
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "prefix_padding_ms": 300,
            "silence_duration_ms": 500
        },
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm16",
        "input_audio_transcription": {
            "model": "whisper-1"
        },
        "tool_choice": "auto",
        "temperature": 0.8,
        "max_response_output_tokens": "inf",
        "tools": []
    }
}

and then i send my audio:

{
    'type': 'conversation.item.create',
    'item': {
        'type': 'message',
        'role': 'user',
        'content': [
            {
                'type': 'input_audio',
                'audio': audio_64
            }
        ]
    }
}

in the response i have everything except

conversation.item.input_audio_transcription

Can someone please help?

mdagost · October 9, 2024, 8:15pm

Same—I’m not seeing the audio transcription either.

Eigenspan · October 9, 2024, 9:06pm

Not only am i not seeing the transcription but i get an error message event from openai saying something along the lines of

Invalid parameter session.input_audio_transcription.enabled parameter doesnt exist.

I’ve both typed it in and copy and pasted directly from the docs but no luck…

j.wischnat · October 10, 2024, 1:07pm

as mentioned here, you can leave out the “enabled” key which resolved it for some users. For me however, this didn’t work but maybe you will have more luck. I really need the transcription as well.

prajwal1 · December 24, 2024, 9:25am

To view the input audio transcript with realtime model you must first set input_audio_transcription in session.update config.

     input_audio_transcription: {
        model: 'whisper-1',
      },

Then the input transcript is available in ‘conversation.item.input_audio_transcription.completed’ response type.

Not sure why this isn’t mentioned in the realtime api docs.

hexorbusa · December 31, 2024, 8:13am

I am also getting the null in the user transcription

{
  "type": "conversation.item.created",
  "event_id": "event_AkR2BLE7l9oMUumIva3Ku",
  "previous_item_id": null,
  "item": {
    "id": "item_AkR29UqpepukIR4ioIUYO",
    "object": "realtime.item",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "transcript": null
      }
    ]
  }
}

patrick.leprince · February 14, 2025, 4:34pm

same problem “transcript” is empty. (tested with webRTC the 14/02/2025)
I have put in the session cretation:
“input_audio_transcription”: {
“model”: “whisper-1”,
“language”: “fr”
},
I have: in the message “conversation.item.created”

"content": [

  {
    "type":

“input_audio”,=
“transcript”: null
}

patrick.leprince · February 14, 2025, 4:35pm

Ihave the same problem “transcript” is empty. (tested with webRTC the 14/02/2025)
I have put in the session cretation:
“input_audio_transcription”: {
“model”: “whisper-1”,
“language”: “fr”
},
I have: in the message “conversation.item.created”

“content”: [

{
“type”:


“input_audio”,=
“transcript”: null
}

activescott · February 18, 2025, 3:18am

He everyone. I have been using transcriptions extensively using WebRTC approach and the peerConnection’s data channel. I realize this is not web sockets directly, but it is similar. You can see a working example of this in TypeScript/JavaScript at the demo from this repo: GitHub - activescott/typescript-openai-realtime-api: TypeScript OpenAI Realtime API Client & Examples

Click on the demo there and check the “Transcribe User Audio” and talk and you’ll see events come back with transcriptions.

A couple things that I have noticed along the way:

Make sure when you do the session request to include the input_audio_transcription field as part of the session request to get the ephemeral token. If you do not do it there, you have to send a separate session.update client event to update it with the transcriptions. This follow up session.update event will work - I use it in the example in fact.
Be careful about background noise. Sometimes the Realtime API will respond to speech, but the transcriptions are wrong or blank because the Whisper-1 model used for transcriptions doesn’t interpret the speech the same way that the Realtime API model does.

I’ve only really tried this in English, although the transcriptions are really just from Whisper so anything that works with Whisper should work for transcriptions with the Realtime API too.

Hope this helps someone!

Scott

ashitaka · February 28, 2025, 6:30pm

This is the answer, part of the session creation.

Bartek_Jach · April 29, 2025, 12:52pm

Anyone having issues with the Assistant (AI) transcript?

I’m using the Realtime API with modalities: [“text”, “audio”] and sending a session.update immediately after the data channel opens to confirm the modalities.

The session is created successfully with both text and audio modalities confirmed in the payload but during the session:

I only receive audio events (response.audio.done, etc.).
I never receive any response.text.delta, response.text.done or response.output_item.added events containing assistant text.
this happens even when the AI says full sentences — not just tiny utterances.
no response.content_part.added or text delta events either.

I’ve checked everything on my end.. the connection is healthy and stays open.

session.update is acknowledged successfully.

Model used: gpt-4o-realtime-preview-2024-12-17. Prompts are simple and clean. This happens consistently across dozens of sessions.

questions:

Is this a known issue?
Are there any specific conditions under which the Realtime API would suppress text output entirely while streaming audio? For eg does function calling block assistant transcripts from coming in?

adito30 · May 12, 2025, 1:41pm

Hey Scott, went through your repo and it looks like you are just doing a session update to get user input audio transcriptions. However, I still am not able to get it to work. My conversation items have ‘null’ transcript content and i never receive “conversation.item.input_audio_transcription.completed” messages from the server. Any suggestions?

Session Updated: {
  type: 'session.updated',
  event_id: 'event_BWNN4Ka6Hsm9UCmnulpNb',
  session: {
    id: 'sess_BWNN4fcr1ZIsimrOsQM81',
    object: 'realtime.session',
    expires_at: 1747057834,
    input_audio_noise_reduction: null,
    turn_detection: {
      type: 'server_vad',
      threshold: 0.5,
      prefix_padding_ms: 300,
      silence_duration_ms: 1000,
      create_response: true,
      interrupt_response: true
    },
    input_audio_format: 'pcm16',
    input_audio_transcription: { model: 'whisper-1', language: 'en', prompt: null },
    client_secret: null,
    include: null,
    model: 'gpt-4o-realtime-preview-2024-12-17',
    modalities: [ 'text', 'audio' ],
    instructions: 'prompt',
    voice: 'alloy',
    output_audio_format: 'pcm16',
    tool_choice: 'auto',
    temperature: 0.8,
    max_response_output_tokens: 4000,
    tools: []

zalt · July 3, 2025, 11:07pm

Thanks, this is the correct answer indeed.

EXAMPLE

            const openAIUpdateRequest = {
                type: 'session.update',
                session: {
                    input_audio_transcription: {
                        model: 'whisper-1', // MUST set this to get the input audio transcription. otherwise you only get the output audio transcription.
                    },
                    instructions: allInstructionsSizeOptimized,
                    tools: allFunctionSignatures,
                },
            };

Topic		Replies	Views
Can't get the user transcription in realtime api API transcribe , realtime	8	2212	May 29, 2025
Retrieving user response from Realtime Voice WebRTC API api	14	696	January 11, 2025
Input_audio_transcription not working in Real-Time — related to g711_ulaw? Bugs realtime	7	1753	December 26, 2024
Missing input audio transcription API api-realtime	6	205	May 12, 2025
Getting no response event for input_audio_transcription in realtime ws API realtime , api-realtime	14	2469	July 17, 2025

[Realtime API] Input audio transcription is not showing

Related topics