Realtime streaming transcription

After reading the docs, my understanding is that I can do real time speech to text with the realtime api. However, I’m finding that I don’t get any transcription deltas back until the user stops talking, which defeats what I’m trying to do.

I’m using WebSockets with this URL: wss://api.openai.com/v1/realtime?intent=transcription

The session configuration is:

{
“type”: “session.update”,
“session”:
{
“type”: “transcription”,
“audio”:
{
“input”:
{
“format”:
{
“type”: “audio/pcm”,
“rate”: 24000
},
“noise_reduction”:
{
“type”: “near_field”
},
“transcription”:
{
“model”: “gpt-4o-mini-transcribe”
},
“turn_detection”:
{
“type”: “server_vad”,
“threshold”: 0.5,
“prefix_padding_ms”: 300,
“silence_duration_ms”: 500
}
}
}
}
}

The flow goes like this:

  • Connect WebSocket
  • Send session config
  • Receive session.created
  • Receive session.updated
  • Start talking
  • Receive input_audio_buffer.speech_started
  • Talk some more
  • Stop talking
  • Receive input_audio_buffer.speech_stopped
  • Receive a bunch of conversation.item.input_audio_transcription.delta with individual words
  • Receive conversation.item.input_audio_transcription.completed with the full text

I would like to get the deltas while I am talking, not at the end.

I understand that this does not work on Azure, but I’m using the OpenAI endpoint.

The documentation I am using is here:

I found this which helped me understand what is going on: Realtime transcription messages flow is wrong - #16 by 6r0m

It seems that no transcription is returned until the accumulated audio is deemed ready. This happens if the client sends an input_audio_buffer.commit message or the server decides it’s time based on the server_vad or semantic_vad config. I tried sending commit messages more often and the accuracy was awful… maybe because I was committing partial words? I also tried server_vad with a small silence_duration_ms and semantic_vad with eagerness = high. Slightly better, but not the experience I am going for. I have seen UIs where the text is built as the user speaks and that’s what I want. I don’t see how that’s possible with the realtime API, but I would be very happy to be proven wrong!

One last update. I have reverted to using the Whisper API. It is a lot less expensive and by implementing my own VAD code I was able to do partial transcriptions and update the text in my UI as the user is talking. It works OK and is a lot better user experience than buffering all the audio and doing a one-time transcribe at the end.

One concern I had is that I’m using POST to https://api.openai.com/v1/audio/transcriptions which is slower than WebSockets, but in practice it seems OK if you can live with some delay.