Realtime Transcription (speech-to-text) via WebRTC Extremely Delayed – Is This Expected?

yuye · May 29, 2025, 5:42pm

I am using Realtime Transcription (Speech to text) using WebRTC with voice activity detection (VAD) to handle start/stop of speech. Everything seems to work: I’m able to stream audio and receive transcription through the WebRTC data channel consistently.

The Problem is:

The experience is far from real-time. I am experiencing delay for >10 seconds. A realtime service with this delay feels so wrong.

Need help:

Has anyone else experienced this? Is this level of delay expected with Realtime Transcription? Or am I missing something in setup or usage?

What I’m observing:

After WebRTC connection success and after transcription_session.update success.
I start speaking.
I stop speaking.
Over 10 seconds later, I finally receive:
- input_audio_buffer.speech_started
- input_audio_buffer.speech_stopped
- input_audio_buffer.committed
- etc
Only after that do I receive:
- conversation.item.input_audio_transcription.delta (multiple of these)
- conversation.item.input_audio_transcription.completed

That >10 sec delay is making the transcription practically unusable for anything interactive. I never get delta updates during speech, and it feels like everything only triggers well after I’ve stopped talking — always 10 seconds later or more.

(p.s. There is no issue with my network connection/speed)

What I expected:

speech_started and transcription deltas to start arriving while I’m still speaking.
A much quicker reaction after speech ends (e.g., <1s) with VAD telling me the speech stopped.

My code:

(1) I did receive this message as confirm in data channel:

{
  "type": "transcription_session.updated",
  "event_id": "event_Bcb50b2dB9TknBYtWMyML",
  "session": {
    "id": "sess_Bcb4zNwTLXwtCWASSpIFG",
    "object": "realtime.transcription_session",
    "expires_at": 1748540497,
    "input_audio_noise_reduction": {
      "type": "far_field"
    },
    "turn_detection": {
      "type": "server_vad",
      "threshold": 0.5,
      "prefix_padding_ms": 200,
      "silence_duration_ms": 600
    },
    "input_audio_format": "pcm16",
    "input_audio_transcription": {
      "model": "gpt-4o-mini-transcribe",
      "language": "en",
      "prompt": ""
    },
    "client_secret": null,
    "include": [
      "item.input_audio_transcription.logprobs"
    ]
  }

Is this just how the realtime service is designed to work? or is something wrong in my setup?

Would love some guidance—this delay is a real blocker for my use case.

=========================================================

[Resolved] Update – May 30, 2025

I discovered the root cause of the transcription delay: Unity WebRTC’s AudioStreamTrack does not support pcm16, g711_ulaw, or g711_alaw encoding formats required by OpenAI’s Realtime Transcription API. Unity’s audio track for WebRTC seems re-encode it to other format—resulting in unsupported media and causing the transcription to delay for >10 sec.

I am surprised to see that the transcription still works through…

How I fixed it:

Instead of relying on WebRTC AudioStreamTrack, I sent my audio data manually over the WebRTC data channel using the input_audio_buffer.append message type. I converted my audio to pcm16 before sending it, as required.

This method works with or without VAD, including server-side VAD and semantic VAD.
Once I made this change, the delay disappeared and real-time transcription started working as expected.

May I Ask for:

Open AI team, please provide better error messages. It is so hard to debug as there is no error or warning sent over in situation like this.

API realtime transcribe speech api-realtime streaming #webrtc

aza · June 1, 2025, 4:21pm

I just did a quick test and I don’t see that sort of delay for the transcript on my WebRTC call. Seems to be about 1s for me.

Legend:
: Delta
: Done

[17:06:17 INF] Session updated: {
  "type": "session.updated",
  "session": {
    "modalities": [
      "text",
      "audio"
    ],
    "instructions": "I\u0027ll say a colour, give me a matching fruit.",
    "model": "gpt-4o-realtime-preview-2024-12-17",
    "voice": "coral",
    "input_audio_format": "pcm16",
    "output_audio_format": "pcm16",
    "input_audio_transcription": {
      "model": "whisper-1"
    },
    "turn_detection": {
      "type": "server_vad",
      "threshold": 0.5,
      "prefix_padding_ms": 300,
      "silence_duration_ms": 200,
      "create_response": true
    },
    "tools": [],
    "tool_choice": "auto"
  },
  "event_id": "event_BdfUQ8OZj7Mupyoj8Wtfz"
}
[17:06:17 INF] AI ⌛: Ready
[17:06:17 INF] AI ⌛: !
[17:06:18 INF] AI ✅: Ready!
[17:06:20 INF] AI ⌛: Ban
[17:06:20 INF] AI ⌛: ana
[17:06:20 INF] AI ⌛: .
[17:06:21 INF] AI ✅: Banana.
[17:06:21 INF] ME ⌛: いやだ
[17:06:21 INF] ME ✅: いやだ
[17:06:23 INF] AI ⌛: Apple
[17:06:23 INF] AI ⌛: .
[17:06:23 INF] ME ⌛: Red.
[17:06:23 INF] ME ✅: Red.
[17:06:23 INF] AI ✅: Apple.
[17:06:26 INF] AI ⌛: Dragon
[17:06:26 INF] AI ⌛: fruit
[17:06:26 INF] AI ⌛: .
[17:06:26 INF] ME ⌛: Pink.
[17:06:26 INF] ME ✅: Pink.
[17:06:27 INF] AI ✅: Dragon fruit.
[17:06:29 INF] AI ⌛: G
[17:06:29 INF] AI ⌛: rap
[17:06:29 INF] AI ⌛: es
[17:06:30 INF] AI ⌛: .
[17:06:30 INF] AI ✅: Grapes.
[17:06:30 INF] ME ⌛: turtle
[17:06:30 INF] ME ✅: turtle
[17:06:33 INF] AI ⌛: Ki
[17:06:33 INF] AI ⌛: wi
[17:06:33 INF] AI ⌛: .
[17:06:33 INF] ME ⌛: Brown.
[17:06:33 INF] ME ✅: Brown.

The one observation I’d add is that I’m not able to get PCMU or PCMA (G711) to work that well with OpenAI WebRTC calls. A few months ago I don’t think they were even supported.

My logs above was using 2 channel OPUS which seems to work teh best.

Relevant chunk of SDP offer:

m=audio 9 UDP/TLS/RTP/SAVP 111 101
c=IN IP4 0.0.0.0
a=mid:0
a=rtpmap:111 OPUS/48000/2
a=fmtp:111 useinbandfec=1
a=rtpmap:101 telephone-event/8000

lenduya · June 10, 2025, 3:38am

Could you please share the code on how to send audio data through the data channel?

Topic		Replies	Views
Implementing gpt-realtime and gpt4-4o-transcribe for a streaming transcription API streaming , transcribe , gpt-realtime	9	1399	September 15, 2025
Realtime transcription api using WebRTC API transcribe , realtime	0	160	May 20, 2025
Realtime streaming transcription API api-realtime	4	334	February 23, 2026
WebRTC transcription guide seems to be broken Bugs	12	1051	April 1, 2025
Realtime API, getUserMedia, and WebRTC - does mic audio need to be converted to PCM16 for whisper ai transcription to work? API	0	194	February 7, 2025