We handle incoming calls using the realtime API over SIP. Because we handle incoming calls we send a greeting with response.create as soon as session is created.
We experienced bug on that greeting when the end user was using speaker mode on his phone. The model was hearing itself and triggering VAD ONLY on the greeting. Few seconds later, when the model sent an automatic response.create, there was no issue.
For the config, we use semantic VAD.
We were receiving an input_audio_buffer.speech_started event that triggered VAD (therefore output_audio_buffer.cleared was sent by the server) and the greeting message was cut and followed by a weird answer by the model (because it heard itself).
We tryed multiple config tricks and none of them solved the issue. here is what we tested ;
- setting
turn_detection.interrupt_response = false - setting
turn_detection.eagerness = 'low' - setting
audio.input.noise_reduction = ‘far_field’ - Wait for
session.created | session.updatedto ensure the right configuration before sendingresponse.create
After this, we were confident it was neither a config issue or a race condition between the server config and the response.create
Because the issue only happened on the response.createwe triggered at the beginning of the call and not on the response.createdautomatically triggered by the model after that, we though it could be related to a response.create internal hardcoded setting/config. However, we triggered after a few seconds (~5s), the response.create were not interrupted either.
We concluded it was related to Acoustic Echo Cancellation (AEC) being broken/unstable for the first few seconds of the call.
Truth or not, we decided to :
- Accept the call with VAD turned off :
turn_detection = null - Send the greeting with
response.createon WS connection immediately (or after thesession.createdserver event) - Wait for
output_audio_buffer.done | output_audio_buffer.stoppedwith the response_id corresponding to our greeting - Once these events are received, send
session.updatewith our actual call config (VAD turned on)
We tested for an entire afternoon and it seems to resolve our issue.
I am open to hear or provide any feedback, to discuss on this topic or hear any other working solution that could be more robust.