Hi OpenAI Realtime Team,
I’m integrating the Realtime API with SIP trunking (Twilio → OpenAI Realtime SIP → Realtime WebSocket). Everything is working well overall, but I’m running into one issue that’s blocking me from doing a large-scale deployment.
Issue
When an incoming call is answered, I wait for the WebSocket connection to fully open and then use response.create to send an initial greeting. However:
-
Sometimes the AI repeats the initial greeting.
-
Sometimes the initial greeting is delayed.
-
If the greeting is delayed and the caller speaks first, the AI sometimes replies in the wrong language (I’ve had responses in Chinese and Spanish).
This inconsistency only happens when the user speaks before the model has delivered the greeting.
Observation
When I call 1-800-CHAT-GPT, ChatGPT seems to answer instantly—before the caller can speak at all. It feels like the greeting is being triggered in a different way than using a response.create after WebSocket open.
Question
Are you using:
-
The same WebSocket flow available through the public Realtime API?
or -
Some session-level configuration or internal mechanism that lets the model “start speaking first” before user audio is processed?
I’d like to formally optimize this flow on my end so I can guarantee:
-
The AI always speaks first
-
There is no language confusion
-
No repeated greetings
GPT-Realtime has been fantastic so far—this is the only remaining challenge before I move ahead with broader deployment of our phone-based AI agent.
Any guidance on recommended call-answering architecture or best practices for this first-speak behavior would be greatly appreciated.
Thank you!