Realtime streaming transcription

After reading the docs, my understanding is that I can do real time speech to text with the realtime api. However, I’m finding that I don’t get any transcription deltas back until the user stops talking, which defeats what I’m trying to do.

I’m using WebSockets with this URL: wss://api.openai.com/v1/realtime?intent=transcription

The session configuration is:

{
“type”: “session.update”,
“session”:
{
“type”: “transcription”,
“audio”:
{
“input”:
{
“format”:
{
“type”: “audio/pcm”,
“rate”: 24000
},
“noise_reduction”:
{
“type”: “near_field”
},
“transcription”:
{
“model”: “gpt-4o-mini-transcribe”
},
“turn_detection”:
{
“type”: “server_vad”,
“threshold”: 0.5,
“prefix_padding_ms”: 300,
“silence_duration_ms”: 500
}
}
}
}
}

The flow goes like this:

  • Connect WebSocket
  • Send session config
  • Receive session.created
  • Receive session.updated
  • Start talking
  • Receive input_audio_buffer.speech_started
  • Talk some more
  • Stop talking
  • Receive input_audio_buffer.speech_stopped
  • Receive a bunch of conversation.item.input_audio_transcription.delta with individual words
  • Receive conversation.item.input_audio_transcription.completed with the full text

I would like to get the deltas while I am talking, not at the end.

I understand that this does not work on Azure, but I’m using the OpenAI endpoint.

The documentation I am using is here:

I found this which helped me understand what is going on: Realtime transcription messages flow is wrong - #16 by 6r0m

It seems that no transcription is returned until the accumulated audio is deemed ready. This happens if the client sends an input_audio_buffer.commit message or the server decides it’s time based on the server_vad or semantic_vad config. I tried sending commit messages more often and the accuracy was awful… maybe because I was committing partial words? I also tried server_vad with a small silence_duration_ms and semantic_vad with eagerness = high. Slightly better, but not the experience I am going for. I have seen UIs where the text is built as the user speaks and that’s what I want. I don’t see how that’s possible with the realtime API, but I would be very happy to be proven wrong!

One last update. I have reverted to using the Whisper API. It is a lot less expensive and by implementing my own VAD code I was able to do partial transcriptions and update the text in my UI as the user is talking. It works OK and is a lot better user experience than buffering all the audio and doing a one-time transcribe at the end.

One concern I had is that I’m using POST to https://api.openai.com/v1/audio/transcriptions which is slower than WebSockets, but in practice it seems OK if you can live with some delay.

I’m curious why you need real-time transcription. If you are transcribing meetings, you can actually get transcripts from certain platforms directly from the platform itself. This approach does require integrating separately with each platform though.

If you prefer an API that works across platforms you can use a real-time transcription API that allows you to try out different models so that you can pick the model that captures the most context. Some of the real-time transcription APIs will also give you speaker names so that you know who was talking throughout the conversation.

I don’t need real-time transcription, but I prefer that experience to recording everything the user says and then transcribing. I think the user seeing the words come up as they speak is a good form of feedback.