Discussion around syncing real-time AI-generated transcript deltas with WebRTC audio playback to ensure speech and on-screen text appear in natural alignment.

kivseddy · April 10, 2025, 12:37pm

Hey folks,
I’m using OpenAI’s real-time API via WebRTC, and listening to events like response.audio_transcript.delta to stream live transcriptions while audio is being played. I’ve implemented word-by-word transcript updates, but I’m running into an issue:

The deltas arrive much faster than the actual audio plays back, so the full text appears on the screen long before it’s spoken. This makes interruption feel off — even if I stop the audio early, the on-screen text has already revealed most of the message.

Is there any way to sync transcript deltas with WebRTC audio playback more precisely? Do these deltas contain timestamps or identifiers I could use to throttle the UI to match the speech timing? Or are there any smart strategies people are using to fix this?

Appreciate any help or direction!

ofekby_genway · May 6, 2025, 4:21pm

You’re absolutely right — having two independent, untimed streams creates real challenges. While audio timing can be reasonably inferred from chunk duration and the start time, the lack of timestamps on the transcript deltas makes syncing much harder. It especially breaks down when trying to truncate playback and persist only what the user actually heard — without timing, there’s no clear way to know which text to keep. For real-time captioning, this limitation is even more pronounced; some heuristics can help approximate alignment, but it’s far from convenient.

This issue is also present in the Websocket API

joshhubert · September 9, 2025, 2:06am

The best I’ve been able to do is gather the transcript deltas in an asyncio.Queue (python) and have another coroutine print them out with a fixed delay, tuned to match speech speed. It works pretty well once you fiddle with the delay time. If it falls behind, you can add logic to temporarily disable the delay and print everything left in the queue on receiving the “output_audio_buffer.stopped” message.

Topic		Replies	Views
Syncing audio and text in realtime api API gpt-4 , realtime	0	228	March 20, 2025
Multiple API calls - high latency; options / product suggestion API chatgpt	21	3404	December 25, 2023
GPT-4o-transcribe realtime, the .delta updates not received during the transcription API transcribe	3	105	September 16, 2025
How to overcome latency in response API gpt-4 , chatgpt	3	3150	February 19, 2024
How to decrease the latency of Text-To-Speech API? API gpt-4 , api	6	4586	April 26, 2024

Discussion around syncing real-time AI-generated transcript deltas with WebRTC audio playback to ensure speech and on-screen text appear in natural alignment.

Related topics