Hi,
I’m trying to use the new STT model. I’m sending the following frame to the “wss://api.openai.com/v1/realtime?intent=transcription” WebSocket endpoint:
{
"type": "transcription_session.update",
"include": [
"item.input_audio_transcription.logprobs"
],
"input_audio_format": "pcm16",
"input_audio_transcription": {
"prompt": "",
"language": "",
"model": "gpt-4o-mini-transcribe"
},
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500,
"create_response": true
},
"input_audio_noise_reduction": {
"type": "near_field"
}
}
but I get the response
{
"type": "transcription_session.created",
"event_id": "event_BDIjP2WyEWBCFq31A1QrP",
"session": {
"id": "sess_BDIjPc4zRbhQgBDcJSDXL",
"object": "realtime.transcription_session",
"expires_at": 1742511767,
"input_audio_noise_reduction": null,
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 200
},
"input_audio_format": "pcm16",
"input_audio_transcription": null, # <--- this became null
"client_secret": null,
"include": null
}
}
The transcription also doesn’t work. I do get speech_started and speech_stopped events:
"{\"type\":\"input_audio_buffer.speech_started\",\"event_id\":\"event_BDIjZmZPzKtGoZb3BG1g2\",\"audio_start_ms\":9780,\"item_id\":\"item_BDIjZcVYwyGgsUmbpHA83\"}"
but instead of transcript events as documented I seem to receive conversation.item
events:
"{\"type\":\"conversation.item.created\",\"event_id\":\"event_BDIja3273Hm7Ufs4GC5rL\",\"previous_item_id\":null,\"item\":{\"id\":\"item_BDIjZcVYwyGgsUmbpHA83\",\"object\":\"realtime.item\",\"type\":\"message\",\"status\":\"completed\",\"role\":\"user\",\"content\":[{\"type\":\"input_audio\",\"transcript\":null}]}}"
From the responses, I suspect that it’s still treating the connection as the non-transcription Realtime API. Not sure what I’m doing wrong though.
Any ideas would be appreciated.