Hi!
Using the realtime API via WebRTC, how can we now determine when the AI begins and finishes talking?
We used to use the output_audio_buffer events, but they appear to have been removed.
We can do audio volume analysis, obviously, but we need to account for natural pauses and it’s generally less accurate. Is there a “correct” way to do this with the new API?
Unfortunately that event fires when the audio message has been finished streaming data, not when it’s finished playing it. You could use that to estimate - but it’s not nearly as useful as the old event.
Correct… the .done signals the end of audio data from gpt-realtime to the socket. You have to implement a playback FIFO queue that takes .delta chunks and queues them. Audio is only done when two things have happened: you have the .done even from the socket AND you have exhausted the playback queue. Be careful not to assume that either one of those things by itself means you’re done’ with playback! By the way, this mechanism is also really important for handling interruptions… The gpt-server may have sent you 10 seconds of audio and sent the “.done” event. So now that audio is playing but the user wants to interrupt… you have to zero-out your local queue of pending audio, not just assume realtime can handle it.
It looks like they’ve started sending the output_audio_buffer events again! They’re still removed from the docs though. So my problem is “fixed” but it’d be handy if someone from the team would weigh in.