After reading the docs, my understanding is that I can do real time speech to text with the realtime api. However, I’m finding that I don’t get any transcription deltas back until the user stops talking, which defeats what I’m trying to do.
I’m using WebSockets with this URL: wss://api.openai.com/v1/realtime?intent=transcription
The session configuration is:
{
“type”: “session.update”,
“session”:
{
“type”: “transcription”,
“audio”:
{
“input”:
{
“format”:
{
“type”: “audio/pcm”,
“rate”: 24000
},
“noise_reduction”:
{
“type”: “near_field”
},
“transcription”:
{
“model”: “gpt-4o-mini-transcribe”
},
“turn_detection”:
{
“type”: “server_vad”,
“threshold”: 0.5,
“prefix_padding_ms”: 300,
“silence_duration_ms”: 500
}
}
}
}
}
The flow goes like this:
- Connect WebSocket
- Send session config
- Receive session.created
- Receive session.updated
- Start talking
- Receive input_audio_buffer.speech_started
- Talk some more
- Stop talking
- Receive input_audio_buffer.speech_stopped
- Receive a bunch of conversation.item.input_audio_transcription.delta with individual words
- Receive conversation.item.input_audio_transcription.completed with the full text
I would like to get the deltas while I am talking, not at the end.
I understand that this does not work on Azure, but I’m using the OpenAI endpoint.
The documentation I am using is here: