Hey folks,
I’m using OpenAI’s real-time API via WebRTC, and listening to events like response.audio_transcript.delta
to stream live transcriptions while audio is being played. I’ve implemented word-by-word transcript updates, but I’m running into an issue:
The deltas arrive much faster than the actual audio plays back, so the full text appears on the screen long before it’s spoken. This makes interruption feel off — even if I stop the audio early, the on-screen text has already revealed most of the message.
Is there any way to sync transcript deltas with WebRTC audio playback more precisely? Do these deltas contain timestamps or identifiers I could use to throttle the UI to match the speech timing? Or are there any smart strategies people are using to fix this?
Appreciate any help or direction!