How to determine if a response audio has finished in Realtime API (WebRTC)

Hi!
Using the realtime API via WebRTC, how can we now determine when the AI begins and finishes talking?
We used to use the output_audio_buffer events, but they appear to have been removed.

We can do audio volume analysis, obviously, but we need to account for natural pauses and it’s generally less accurate. Is there a “correct” way to do this with the new API?

2 Likes

Hi @Tom_Kail

According to the API Reference, you should be getting response.output_audio.done event from the server when the output audio finishes streaming.

2 Likes

Unfortunately that event fires when the audio message has been finished streaming data, not when it’s finished playing it. You could use that to estimate - but it’s not nearly as useful as the old event.

Correct… the .done signals the end of audio data from gpt-realtime to the socket. You have to implement a playback FIFO queue that takes .delta chunks and queues them. Audio is only done when two things have happened: you have the .done even from the socket AND you have exhausted the playback queue. Be careful not to assume that either one of those things by itself means you’re done’ with playback! By the way, this mechanism is also really important for handling interruptions… The gpt-server may have sent you 10 seconds of audio and sent the “.done” event. So now that audio is playing but the user wants to interrupt… you have to zero-out your local queue of pending audio, not just assume realtime can handle it.

It looks like they’ve started sending the output_audio_buffer events again! They’re still removed from the docs though. So my problem is “fixed” but it’d be handy if someone from the team would weigh in.

1 Like

the output_audio_buffer.* events were never removed, but they somehow got dropped from the docs. will fix.

2 Likes

whats the difference between output_audio_buffer.stopped and response.output_audio.done

we arent seeing any audio_output_buffer.*

edit: on websocket
nor are we seeing response.audio.done
we are seeing response.output_audio.done

on websocket you control audio playout, so there is no output_audio_buffer.stopped event. That event is only sent for SIP and WebRTC.