Can't get the user transcription in realtime api

Hi everyone, I am implementing the OpenAI Realtime API and have configured the session to include audio transcription using the following configuration:

input_audio_transcription: {
    model: “whisper-1”
}

However, the audio input provided by the user does not generate a transcript. Instead, the transcript field always returns null. Below is the response received from the API:

{
  "type": "conversation.item.created",
  "event_id": "event_AkR2BLE7l9oMUumIva3Ku",
  "previous_item_id": null,
  "item": {
    "id": "item_AkR29UqpepukIR4ioIUYO",
    "object": "realtime.item",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "transcript": null
      }
    ]
  }
}

so how can I get the user transcript from the Realtime API?

Can someone please help?

1 Like

Have you solved this yet?

You need to add it to your session.update to retrieve. By default, it isn’t included. Here’s an example:

/*****************************************

  • CONFIGURE DATA FOR DATA CHANNEL   *
    

*****************************************/
function configureData() {
const event = {
type: ‘session.update’,
session: {
modalities: [‘text’, ‘audio’],
tools: [
{ type: ‘function’, name: ‘functionOne’, description: ‘Function one description’ },
{ type: ‘function’, name: ‘functionTwo’, description: ‘Function two description’ },
{ type: ‘function’, name: ‘functionThree’, description: ‘Function three description’ },
{
type: ‘function’,
name: ‘functionFour’,
description: ‘Function four description’,
},
{
type: ‘function’,
name: ‘functionFive’,
description: ‘Handles text from AI response’,
},
],
input_audio_transcription: {
model: ‘whisper-1’,
},
},
};

if (dataChannel && dataChannel.readyState === 'open') {
  dataChannel.send(JSON.stringify(event));
  console.log('Session update sent.');
}

}

**NOTE: You don’t need the functions however, this shows how you would include them

Also, you need to pull the Assistant and User audio/text from the logs and display them in your UI if you want them visually logged for the user.