I am using Realtime Transcription (Speech to text) using WebRTC with voice activity detection (VAD) to handle start/stop of speech. Everything seems to work: I’m able to stream audio and receive transcription through the WebRTC data channel consistently.
The Problem is:
The experience is far from real-time. I am experiencing delay for >10 seconds. A realtime service with this delay feels so wrong.
Need help:
Has anyone else experienced this? Is this level of delay expected with Realtime Transcription? Or am I missing something in setup or usage?
What I’m observing:
- After WebRTC connection success and after transcription_session.update success.
- I start speaking.
- I stop speaking.
- Over 10 seconds later, I finally receive:
input_audio_buffer.speech_started
input_audio_buffer.speech_stopped
input_audio_buffer.committed
- etc
- Only after that do I receive:
conversation.item.input_audio_transcription.delta
(multiple of these)conversation.item.input_audio_transcription.completed
That >10 sec delay is making the transcription practically unusable for anything interactive. I never get delta updates during speech, and it feels like everything only triggers well after I’ve stopped talking — always 10 seconds later or more.
(p.s. There is no issue with my network connection/speed)
What I expected:
speech_started
and transcription deltas to start arriving while I’m still speaking.- A much quicker reaction after speech ends (e.g., <1s) with VAD telling me the speech stopped.
My code:
(1) I did receive this message as confirm in data channel:
{
"type": "transcription_session.updated",
"event_id": "event_Bcb50b2dB9TknBYtWMyML",
"session": {
"id": "sess_Bcb4zNwTLXwtCWASSpIFG",
"object": "realtime.transcription_session",
"expires_at": 1748540497,
"input_audio_noise_reduction": {
"type": "far_field"
},
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 200,
"silence_duration_ms": 600
},
"input_audio_format": "pcm16",
"input_audio_transcription": {
"model": "gpt-4o-mini-transcribe",
"language": "en",
"prompt": ""
},
"client_secret": null,
"include": [
"item.input_audio_transcription.logprobs"
]
}
Is this just how the realtime service is designed to work? or is something wrong in my setup?
Would love some guidance—this delay is a real blocker for my use case.
=========================================================
[Resolved] Update – May 30, 2025
I discovered the root cause of the transcription delay: Unity WebRTC’s AudioStreamTrack
does not support pcm16
, g711_ulaw
, or g711_alaw
encoding formats required by OpenAI’s Realtime Transcription API. Unity’s audio track for WebRTC seems re-encode it to other format—resulting in unsupported media and causing the transcription to delay for >10 sec.
I am surprised to see that the transcription still works through…
How I fixed it:
Instead of relying on WebRTC AudioStreamTrack
, I sent my audio data manually over the WebRTC data channel using the input_audio_buffer.append
message type. I converted my audio to pcm16
before sending it, as required.
- This method works with or without VAD, including server-side VAD and semantic VAD.
- Once I made this change, the delay disappeared and real-time transcription started working as expected.
May I Ask for:
- Open AI team, please provide better error messages. It is so hard to debug as there is no error or warning sent over in situation like this.
API realtime transcribe speech api-realtime streaming #webrtc