I am using OpenAI’s real-time API (gpt-4o-realtime-preview-2024-12-17) in a React-based application for live transcription and response generation. However, I am facing an issue where the transcribed text and the generated speech output do not align properly. Sometimes the text appears earlier than expected, or the audio plays with a delay.
Implementation Details:
-
The application uses WebSockets to stream real-time audio to OpenAI.
-
I am using the RealtimeClient from OpenAI’s API to send and receive live audio responses.
-
The WavRecorder and WavStreamPlayer are used to handle audio streaming and playback, since the audio is in 16bitPCM format
-
The text responses are updated dynamically as they arrive via the API.
this is the code for connecting the api
const connectConversation = useCallback(async () => {
const client = clientRef.current;
const wavRecorder = wavRecorderRef.current;
const wavStreamPlayer = wavStreamPlayerRef.current;
await wavRecorder.begin();
await wavStreamPlayer.connect();
try {
const response = await client.connect();
if (response) {
setLoading(false);
client.sendUserMessageContent([{ type: "input_text", text: "Hello!" }]);
if (client.getTurnDetectionType() === "server_vad") {
await wavRecorder.record((data) => client.appendInputAudio(data.mono));
}
}
} catch (error) {
console.error("Error connecting:", error);
}
}, []);
this is the code for getting the response
client.on("conversation.updated", async ({ item, delta }) => {
if (item.role === "assistant" && delta?.audio) {
wavStreamPlayer.add16BitPCM(delta.audio, item.id);
textRef.current = item.formatted.transcript; // Text updates immediately
} else if (delta?.text) {
textRef.current = item.formatted.transcript;
}
if (item.status === "completed" && item.formatted.audio?.length) {
const wavFile = await WavRecorder.decode(item.formatted.audio, 24000, 24000);
setAudiosrc(wavFile.url);
}
});
Problem observed
Couldn’t scroll the text with sync to the audio
scrolling login based on duration as 150 words per minute
const scrollText = () => {
if (!scrollContainerRef.current) return;
const container = scrollContainerRef.current;
const currentTime = Date.now();
const elapsed = currentTime - scrollStartTimeRef.current;
const duration = getScrollDuration(text);
if (elapsed >= duration) {
container.scrollTop = container.scrollHeight - container.clientHeight;
return;
}
const progress = elapsed / duration;
const targetScrollTop = container.scrollHeight - container.clientHeight;
// Smooth easing function for better scrolling
const easeInOutQuad = (t) =>
t < 0.5 ? 2 * t * t : 1 - Math.pow(-2 * t + 2, 2) / 2;
container.scrollTop = targetScrollTop * easeInOutQuad(progress);
animationFrameRef.current = requestAnimationFrame(scrollText);
};
Approach taken
-
Converting 16-bit PCM into an audio source
const wavFile = await WavRecorder.decode(item.formatted.audio, 24000, 24000); setAudiosrc(wavFile.url);
- However, conversion takes time depending on the length of the
response, causing desynchronization.
- However, conversion takes time depending on the length of the
-
Scrolling based on word count (150 WPM rule)
const wordsPerMinute = 150; const words = text.split(" ").length; return (words / wordsPerMinute) * 60 * 1000;
This works for short responses but fails for larger responses due to variation in speech speed.
Questions:
- How can I accurately sync the text scroll with the real-time audio
playback? - Are there any existing libraries or best practices for
handling text-audio synchronization in real-time applications?
Any insights or suggestions would be greatly appreciated!