Discussion around syncing real-time AI-generated transcript deltas with WebRTC audio playback to ensure speech and on-screen text appear in natural alignment.

:waving_hand: Hey folks,
I’m using OpenAI’s real-time API via WebRTC, and listening to events like response.audio_transcript.delta to stream live transcriptions while audio is being played. I’ve implemented word-by-word transcript updates, but I’m running into an issue:

The deltas arrive much faster than the actual audio plays back, so the full text appears on the screen long before it’s spoken. This makes interruption feel off — even if I stop the audio early, the on-screen text has already revealed most of the message.

Is there any way to sync transcript deltas with WebRTC audio playback more precisely? Do these deltas contain timestamps or identifiers I could use to throttle the UI to match the speech timing? Or are there any smart strategies people are using to fix this?

Appreciate any help or direction!

2 Likes

You’re absolutely right — having two independent, untimed streams creates real challenges. While audio timing can be reasonably inferred from chunk duration and the start time, the lack of timestamps on the transcript deltas makes syncing much harder. It especially breaks down when trying to truncate playback and persist only what the user actually heard — without timing, there’s no clear way to know which text to keep. For real-time captioning, this limitation is even more pronounced; some heuristics can help approximate alignment, but it’s far from convenient.

This issue is also present in the Websocket API