I’m trying to have a conversation with Realtime API. I record a message and send it, but I also want my audio message to show as a message I send - in other words, I want my recording transcribed.
What I’ve done so far:
- Establish connection with WebSocket. I receive this as a response:
{
"type": "session.created",
"event_id": "event_BTepLqYzeQG3H50OX80NX",
"session": {
"id": "sess_BTepLv2jIuFPxvBlhamPW",
"object": "realtime.session",
"expires_at": 1746409951,
"input_audio_noise_reduction": null,
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 200,
"create_response": true,
"interrupt_response": true
},
"input_audio_format": "pcm16",
"input_audio_transcription": null,
"client_secret": null,
"include": null,
"model": "gpt-4o-realtime-preview",
"modalities": [
"text",
"audio"
],
"instructions": "[redacted]",
"voice": "alloy",
"output_audio_format": "pcm16",
"tool_choice": "auto",
"temperature": 0.8,
"max_response_output_tokens": "inf",
"tools": []
}
}
input_audio_format
isnull
and I want to fix that, so I update the session using this:
{
"type" : "session.update",
"session" : {
"input_audio_transcription" : {
"language" : "en",
"model" : "whisper-1",
"prompt" : "Use a British accent."
}
}
}
- I receive a response back:
{
"type": "session.updated",
"event_id": "event_BTepQRdq8siJYUlUsSyL8",
"session": {
"id": "sess_BTepLv2jIuFPxvBlhamPW",
"object": "realtime.session",
"expires_at": 1746409951,
"input_audio_noise_reduction": null,
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 200,
"create_response": true,
"interrupt_response": true
},
"input_audio_format": "pcm16",
"input_audio_transcription": {
"model": "whisper-1",
"language": "en",
"prompt": "Use a British accent."
},
"client_secret": null,
"include": null,
"model": "gpt-4o-realtime-preview",
"modalities": [
"text",
"audio"
],
"instructions": "[redacted]",
"voice": "alloy",
"output_audio_format": "pcm16",
"tool_choice": "auto",
"temperature": 0.8,
"max_response_output_tokens": "inf",
"tools": []
}
- I send the audio and receive this response:
{
"type" : "conversation.item.create",
"event_id" : "1746408181",
"item" : {
"type" : "message",
"role" : "user",
"content" : [ {
"type" : "input_audio",
"audio" : "UklGRjy+AgBXQVZFZm10IBAAAAABAAEARKwAAIhYAQACABAAZGF0YRi+AgD3//j..."
} ]
}
}
- The item is created:
{
"type": "conversation.item.created",
"event_id": "event_BTepwDIFQNG6ouL9weJro",
"previous_item_id": null,
"item": {
"id": "item_BTepwDRLv35FqFspbIa27",
"object": "realtime.item",
"type": "message",
"status": "completed",
"role": "user",
"content": [
{
"type": "input_audio",
"transcript": null
}
]
}
}
- I want to have a transcript, so I call retrieve with the
item_id
from the previous response and receive this:
{
"type": "conversation.item.retrieved",
"event_id": "event_BTercp18HeG8nWHDcYeiF",
"item": {
"id": "item_BTepwDRLv35FqFspbIa27",
"object": "realtime.item",
"type": "message",
"status": "completed",
"role": "user",
"content": [
{
"type": "input_audio",
"transcript": null,
"audio": "UklGRjy+AgBXQVZFZm10IBAAAAABAAEARKwAAIhYAQACABAAZGF0YRi+AgD3..."
"format": "pcm16"
}
]
}
}
The transcript
is null
after both the conversation.item.created
and conversation.item.retrieved
. I expected the transcript
to have the transcript of the audio I sent.
Am I misunderstanding? How do I retrieve the audio transcript?