Hello everybody,
I am working on integrating the OpenAI Realtime Api with Twilio.
We noticed that calls coming from A1(telephone provider) in Austria were being handled by openai much worse than calls from Magenta, Spusu, Hot Telecom. A1 is the best provider in Austria so this didn’t make a lot of sense from the perspective of latency or audio quality.
I logged all the websocket events(twilio and openai) and found out that the while Hot Telecom, Spusu, Magenta are streaming all the audio(including the portions of silence), the A1 provider is more efficient and is streaming only when the person is talking.
Currently, our settings for the OpenAi realtime API session is set to turn_detection=server_vad:
{
"type": "session.update",
"session": {
"turn_detection": {"type": "server_vad"},
"input_audio_format": "g711_ulaw",
"output_audio_format": "g711_ulaw",
"voice": VOICE,
"instructions": SYSTEM_MESSAGE,
"modalities": ["text", "audio"],
"input_audio_transcription": {"model": "whisper-1"},
"temperature": MODEL_TEMPERATURE,
"tools": TOOLS,
},
}
I have a couple of questions:
-
Do you have a understanding/feeling of how the VAD interacts with the underlying model(does the model receive the full audio or only the portions that VAD identifies as speech)?
-
If I could identify if the phonecall is coming from A1(partial audio) or another provider(full audio with silence included), what do you think a better strategy would be?:
a) Add fake silence to the call
a) Disable VAD/ some other settings
Thank you for your time and sorry if I am fundamentally misunderstanding how the realtime system works.