Syncing audio and text in realtime api

pranavkrishnaa.s · March 20, 2025, 8:28am

I am using OpenAI’s real-time API (gpt-4o-realtime-preview-2024-12-17) in a React-based application for live transcription and response generation. However, I am facing an issue where the transcribed text and the generated speech output do not align properly. Sometimes the text appears earlier than expected, or the audio plays with a delay.

Implementation Details:

The application uses WebSockets to stream real-time audio to OpenAI.
I am using the RealtimeClient from OpenAI’s API to send and receive live audio responses.
The WavRecorder and WavStreamPlayer are used to handle audio streaming and playback, since the audio is in 16bitPCM format
The text responses are updated dynamically as they arrive via the API.

this is the code for connecting the api

const connectConversation = useCallback(async () => {
const client = clientRef.current;
const wavRecorder = wavRecorderRef.current;
const wavStreamPlayer = wavStreamPlayerRef.current;

await wavRecorder.begin();
await wavStreamPlayer.connect();

try {
    const response = await client.connect();
    if (response) {
        setLoading(false);
        client.sendUserMessageContent([{ type: "input_text", text: "Hello!" }]);
        
        if (client.getTurnDetectionType() === "server_vad") {
            await wavRecorder.record((data) => client.appendInputAudio(data.mono));
        }
    }
} catch (error) {
    console.error("Error connecting:", error);
}
}, []);

this is the code for getting the response

client.on("conversation.updated", async ({ item, delta }) => {
if (item.role === "assistant" && delta?.audio) {
    wavStreamPlayer.add16BitPCM(delta.audio, item.id);
    textRef.current = item.formatted.transcript; // Text updates immediately
} else if (delta?.text) {
    textRef.current = item.formatted.transcript;
}

if (item.status === "completed" && item.formatted.audio?.length) {
    const wavFile = await WavRecorder.decode(item.formatted.audio, 24000, 24000);
    setAudiosrc(wavFile.url);
}
});

Problem observed

Couldn’t scroll the text with sync to the audio

scrolling login based on duration as 150 words per minute

const scrollText = () => {
  if (!scrollContainerRef.current) return;

  const container = scrollContainerRef.current;
  const currentTime = Date.now();
  const elapsed = currentTime - scrollStartTimeRef.current;
  const duration = getScrollDuration(text);

  if (elapsed >= duration) {
    container.scrollTop = container.scrollHeight - container.clientHeight;
    return;
  }

  const progress = elapsed / duration;
  const targetScrollTop = container.scrollHeight - container.clientHeight;

  // Smooth easing function for better scrolling
  const easeInOutQuad = (t) =>
    t < 0.5 ? 2 * t * t : 1 - Math.pow(-2 * t + 2, 2) / 2;

  container.scrollTop = targetScrollTop * easeInOutQuad(progress);
  animationFrameRef.current = requestAnimationFrame(scrollText);
};

Approach taken

Converting 16-bit PCM into an audio source const wavFile = await WavRecorder.decode(item.formatted.audio, 24000, 24000); setAudiosrc(wavFile.url);
- However, conversion takes time depending on the length of the
  response, causing desynchronization.
Scrolling based on word count (150 WPM rule)

const wordsPerMinute = 150; const words = text.split(" ").length; return (words / wordsPerMinute) * 60 * 1000;

This works for short responses but fails for larger responses due to variation in speech speed.

Questions:

How can I accurately sync the text scroll with the real-time audio
playback?
Are there any existing libraries or best practices for
handling text-audio synchronization in real-time applications?

Any insights or suggestions would be greatly appreciated!

Topic		Replies	Views
Discussion around syncing real-time AI-generated transcript deltas with WebRTC audio playback to ensure speech and on-screen text appear in natural alignment. API gpt-4 , chatgpt , api	1	162	May 6, 2025
ChatGPT API TTS streaming API api	3	5150	January 21, 2025
How to decrease the latency of Text-To-Speech API? API gpt-4 , api	6	4270	April 26, 2024
Real Time Speech To Text API Disconnects Immediately API realtime , api-realtime	1	287	March 25, 2025
How to Implement a Real-Time Chatbox with Speech-to-Text Integration in OpenAI API? API assistants-api	0	215	January 13, 2025

Syncing audio and text in realtime api

Related topics