What is the mechanism behind realtime speech to speech api, are transcript and audio stream pushed in a synchronized manner?

our use case is that whenever AI says a special phrase, e.g., what's next is, we should ditch the rest of the response and then update prompt, generate a new response and continue to serve the text to our customer.

this works in the text mode (streaming) as we can detect string by comparison, and stop right away as soon as we detect the phrase.

I wonder can realtime api somehow accommodate this, and how. Are text and audio clip be streamed in a sync manner where we can look at the text, and tell if the current audio clip contains the special phrase?

thanks

1 Like