Syncing audio and text in realtime api

I am using OpenAI’s real-time API (gpt-4o-realtime-preview-2024-12-17) in a React-based application for live transcription and response generation. However, I am facing an issue where the transcribed text and the generated speech output do not align properly. Sometimes the text appears earlier than expected, or the audio plays with a delay.

Implementation Details:

  • The application uses WebSockets to stream real-time audio to OpenAI.

  • I am using the RealtimeClient from OpenAI’s API to send and receive live audio responses.

  • The WavRecorder and WavStreamPlayer are used to handle audio streaming and playback, since the audio is in 16bitPCM format

  • The text responses are updated dynamically as they arrive via the API.

this is the code for connecting the api

const connectConversation = useCallback(async () => {
const client = clientRef.current;
const wavRecorder = wavRecorderRef.current;
const wavStreamPlayer = wavStreamPlayerRef.current;

await wavRecorder.begin();
await wavStreamPlayer.connect();

try {
    const response = await client.connect();
    if (response) {
        setLoading(false);
        client.sendUserMessageContent([{ type: "input_text", text: "Hello!" }]);
        
        if (client.getTurnDetectionType() === "server_vad") {
            await wavRecorder.record((data) => client.appendInputAudio(data.mono));
        }
    }
} catch (error) {
    console.error("Error connecting:", error);
}
}, []);

this is the code for getting the response

client.on("conversation.updated", async ({ item, delta }) => {
if (item.role === "assistant" && delta?.audio) {
    wavStreamPlayer.add16BitPCM(delta.audio, item.id);
    textRef.current = item.formatted.transcript; // Text updates immediately
} else if (delta?.text) {
    textRef.current = item.formatted.transcript;
}

if (item.status === "completed" && item.formatted.audio?.length) {
    const wavFile = await WavRecorder.decode(item.formatted.audio, 24000, 24000);
    setAudiosrc(wavFile.url);
}
});

Problem observed

Couldn’t scroll the text with sync to the audio

scrolling login based on duration as 150 words per minute

const scrollText = () => {
  if (!scrollContainerRef.current) return;

  const container = scrollContainerRef.current;
  const currentTime = Date.now();
  const elapsed = currentTime - scrollStartTimeRef.current;
  const duration = getScrollDuration(text);

  if (elapsed >= duration) {
    container.scrollTop = container.scrollHeight - container.clientHeight;
    return;
  }

  const progress = elapsed / duration;
  const targetScrollTop = container.scrollHeight - container.clientHeight;

  // Smooth easing function for better scrolling
  const easeInOutQuad = (t) =>
    t < 0.5 ? 2 * t * t : 1 - Math.pow(-2 * t + 2, 2) / 2;

  container.scrollTop = targetScrollTop * easeInOutQuad(progress);
  animationFrameRef.current = requestAnimationFrame(scrollText);
};

Approach taken

  1. Converting 16-bit PCM into an audio source const wavFile = await WavRecorder.decode(item.formatted.audio, 24000, 24000); setAudiosrc(wavFile.url);

    • However, conversion takes time depending on the length of the
      response, causing desynchronization.
  2. Scrolling based on word count (150 WPM rule)

    const wordsPerMinute = 150; const words = text.split(" ").length; return (words / wordsPerMinute) * 60 * 1000;

    This works for short responses but fails for larger responses due to variation in speech speed.

Questions:

  1. How can I accurately sync the text scroll with the real-time audio
    playback?
  2. Are there any existing libraries or best practices for
    handling text-audio synchronization in real-time applications?

Any insights or suggestions would be greatly appreciated!

1 Like