Currently, I send the audio input and receive back the audio and text output. However, I also want my audio input to be transcribed.
What I’m currently doing:
1. connect through Websocket, receive this back:
{
"type": "session.created",
"event_id": "event_BRsiKo95C4wtXJ9H37LvN",
"session": {
"id": "sess_BRsiKFh7sPhe51tcyYnUp",
"object": "realtime.session",
"expires_at": 1745986676,
"input_audio_noise_reduction": null,
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 200,
"create_response": true,
"interrupt_response": true
},
"input_audio_format": "pcm16",
"input_audio_transcription": null,
"client_secret": null,
"include": null,
"model": "gpt-4o-realtime-preview",
"modalities": [
"audio",
"text"
],
"instructions": "[default instructions]",
"voice": "alloy",
"output_audio_format": "pcm16",
"tool_choice": "auto",
"temperature": 0.8,
"max_response_output_tokens": "inf",
"tools": []
}
}
2. Set the input_audio_transcription
, receive this back:
{
"type": "session.updated",
"event_id": "event_BRsiTisnXbF5OfXOC7R1s",
"session": {
"id": "sess_BRsiKFh7sPhe51tcyYnUp",
"object": "realtime.session",
"expires_at": 1745986676,
"input_audio_noise_reduction": null,
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 200,
"create_response": true,
"interrupt_response": true
},
"input_audio_format": "pcm16",
"input_audio_transcription": {
"model": "gpt-4o-transcribe",
"language": "en",
"prompt": null
},
"client_secret": null,
"include": null,
"model": "gpt-4o-realtime-preview",
"modalities": [
"audio",
"text"
],
"instructions": "[default instructions]",
"voice": "alloy",
"output_audio_format": "pcm16",
"tool_choice": "auto",
"temperature": 0.8,
"max_response_output_tokens": "inf",
"tools": []
}
}
3. Create the conversation item, receive this back:
{
"type" : "conversation.item.create",
"item" : {
"type" : "message",
"role" : "user",
"content" : [ {
"type" : "input_audio",
"audio" : "[audio data]"
} ]
}
}
4. request a response using
{
"type" : "response.create",
"response" : {
"modalities" : [ "text", "audio" ]
}
}
- I receive the audio and text of the output
However, I don’t see the transcription of the input anywhere. I searched through Postman and through the debugger and found nothing.
What am I missing?