What is the mechanism behind realtime speech to speech api, are transcript and audio stream pushed in a synchronized manner?

dingyu · October 7, 2024, 9:22pm

our use case is that whenever AI says a special phrase, e.g., what's next is, we should ditch the rest of the response and then update prompt, generate a new response and continue to serve the text to our customer.

this works in the text mode (streaming) as we can detect string by comparison, and stop right away as soon as we detect the phrase.

I wonder can realtime api somehow accommodate this, and how. Are text and audio clip be streamed in a sync manner where we can look at the text, and tell if the current audio clip contains the special phrase?

thanks

Topic		Replies	Views
What is the difference between realtime-transcription and speech-to-text for Streaming the transcription of an ongoing audio recording? API api , whisper , audio , realtime , api-realtime	2	331	April 1, 2025
Can I use Openai Realtime API for Speech-to-Text? API realtime	5	2473	January 30, 2025
Discussion around syncing real-time AI-generated transcript deltas with WebRTC audio playback to ensure speech and on-screen text appear in natural alignment. API gpt-4 , chatgpt , api	1	130	May 6, 2025
Realtime API Audio Modality output API realtime , api-realtime , api-realtime-speech	7	844	December 13, 2024
Realtime API message response - Audio + Text API realtime	2	878	October 17, 2024

What is the mechanism behind realtime speech to speech api, are transcript and audio stream pushed in a synchronized manner?

Related topics