Can't get the user transcription in realtime api

Hi everyone, I am implementing the OpenAI Realtime API and have configured the session to include audio transcription using the following configuration:

input_audio_transcription: {
    model: “whisper-1”
}

However, the audio input provided by the user does not generate a transcript. Instead, the transcript field always returns null. Below is the response received from the API:

{
  "type": "conversation.item.created",
  "event_id": "event_AkR2BLE7l9oMUumIva3Ku",
  "previous_item_id": null,
  "item": {
    "id": "item_AkR29UqpepukIR4ioIUYO",
    "object": "realtime.item",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "transcript": null
      }
    ]
  }
}

so how can I get the user transcript from the Realtime API?

Can someone please help?

2 Likes

Have you solved this yet?

You need to add it to your session.update to retrieve. By default, it isn’t included. Here’s an example:

/*****************************************

  • CONFIGURE DATA FOR DATA CHANNEL   *
    

*****************************************/
function configureData() {
const event = {
type: ‘session.update’,
session: {
modalities: [‘text’, ‘audio’],
tools: [
{ type: ‘function’, name: ‘functionOne’, description: ‘Function one description’ },
{ type: ‘function’, name: ‘functionTwo’, description: ‘Function two description’ },
{ type: ‘function’, name: ‘functionThree’, description: ‘Function three description’ },
{
type: ‘function’,
name: ‘functionFour’,
description: ‘Function four description’,
},
{
type: ‘function’,
name: ‘functionFive’,
description: ‘Handles text from AI response’,
},
],
input_audio_transcription: {
model: ‘whisper-1’,
},
},
};

if (dataChannel && dataChannel.readyState === 'open') {
  dataChannel.send(JSON.stringify(event));
  console.log('Session update sent.');
}

}

**NOTE: You don’t need the functions however, this shows how you would include them

Also, you need to pull the Assistant and User audio/text from the logs and display them in your UI if you want them visually logged for the user.

I dont’t understand. I have created the session with the right message.
After I have the message : “type”: “session.created”, with :slight_smile: “input_audio_transcription”: {
“model”: “whisper-1”,
“language”: “fr”,
“prompt”: null
},
But I have:
in the message “type”: "conversation.item.created
“role”: “user”,
“content”: [
{
“type”: “input_audio”,
“transcript”: null
}

The right event handler to get the user transcription is this:

conversation.item.input_audio_transcription.completed

Also, this topic could be relevant:

Hi everyone,

I just got through this error, and I want to share how to fix this, if you’re getting null transcription as well ! :smiley: long story short, I was doing the initial ‘post’ handshake w/ ephemeral token with an empty body, and also trying to set some configurations in websockets or all kinds of brute forces until I discovered the answer.

{
  "input_audio_format": "pcm16",
  "input_audio_transcription": {
    "model": "gpt-4o-transcribe",
    "prompt": "",
    "language": "en"
  },
  "turn_detection": {
    "type": "server_vad",
    "threshold": 0.5,
    "prefix_padding_ms": 300,
    "silence_duration_ms": 500
  },
  "input_audio_noise_reduction": {
    "type": "near_field"
  },
  "include": [
    "item.input_audio_transcription.logprobs"
  ]
}

before you initiate the realtime transcription session as you’re triggering the initial post request with your ephemeral token, you need to set the body to the configurations of which model etc

if you send an empty body with a post , handing api key just to get ephemeral token etc, you will get this exact issue of receiving “null” on transcription, and potentially lose a day or two, and give up if you don’t have the will of the Highlander.

The API documentation is a total disaster btw

I’ve moved on and found the solution, may others also reach the same blessing by reading this. Aloha.

2 Likes

You need to ensure that input_audio_transcription is properly configured in the session settings and that the audio input is valid. Try this configuration:
{
“input_audio_transcription”: {
“enabled”: true,
“model”: “whisper-1”
}
}
Also, ensure your request includes valid audio data in the correct format (PCM, WAV, or Opus).

To receive real-time transcriptions using conversation.item.input_audio_transcription.delta, ensure your OpenAI session is updated.

I am also really struggling to get user input audio transcription.

As suggested in the forums, i have updated my session to configure input_audio_transcription. I am fairly sure that my audio input is valid (any ideas how to confirm that?) because i am getting valid audio responses back.

However, I still see the transcript in the conversation.item.created.content as null and I do not receive any conversation.item.input_audio_transcription.completed message from the server.

Please could somebody help? Been working on this for 5 hours and am completely stuck!

Below is the session.updated message I receive:

Session Updated: {
  type: 'session.updated',
  event_id: 'event_BWNN4Ka6Hsm9UCmnulpNb',
  session: {
    id: 'sess_BWNN4fcr1ZIsimrOsQM81',
    object: 'realtime.session',
    expires_at: 1747057834,
    input_audio_noise_reduction: null,
    turn_detection: {
      type: 'server_vad',
      threshold: 0.5,
      prefix_padding_ms: 300,
      silence_duration_ms: 1000,
      create_response: true,
      interrupt_response: true
    },
    input_audio_format: 'pcm16',
    input_audio_transcription: { model: 'whisper-1', language: 'en', prompt: null },
    client_secret: null,
    include: null,
    model: 'gpt-4o-realtime-preview-2024-12-17',
    modalities: [ 'text', 'audio' ],
    instructions: 'prompt',
    voice: 'alloy',
    output_audio_format: 'pcm16',
    tool_choice: 'auto',
    temperature: 0.8,
    max_response_output_tokens: 4000,
    tools: []
  }