We rely on the output_audio_buffer.stopped server event to know when the LLM has finished speaking. We use SIP to connect a Twilio call to gpt-realtime, then monitor the call with a websocket connection. We’ve noticed that if the LLM has made a tool call, the output_audio_buffer.stopped event is not received even after it has finished speaking. We tested by creating a dummy tool that slept for 10 seconds so that the LLM had plenty of time to finish speaking, but we never received the output_audio_buffer.stopped event. Only once the tool call has been answered by sending the function_call_output event to the server is the output_audio_buffer.stopped event received, even though there has already been silence on the line for several seconds.
Is this a bug, or is there a different way for us to know when the LLM has finished speaking when a tool call is pending?