SIP Trunking + Realtime API Call Flow — Initial Greeting Delay & Language Mismatch

Hi OpenAI Realtime Team,

I’m integrating the Realtime API with SIP trunking (Twilio → OpenAI Realtime SIP → Realtime WebSocket). Everything is working well overall, but I’m running into one issue that’s blocking me from doing a large-scale deployment.

Issue

When an incoming call is answered, I wait for the WebSocket connection to fully open and then use response.create to send an initial greeting. However:

  1. Sometimes the AI repeats the initial greeting.

  2. Sometimes the initial greeting is delayed.

  3. If the greeting is delayed and the caller speaks first, the AI sometimes replies in the wrong language (I’ve had responses in Chinese and Spanish).

This inconsistency only happens when the user speaks before the model has delivered the greeting.

Observation

When I call 1-800-CHAT-GPT, ChatGPT seems to answer instantly—before the caller can speak at all. It feels like the greeting is being triggered in a different way than using a response.create after WebSocket open.

Question

Are you using:

  • The same WebSocket flow available through the public Realtime API?
    or

  • Some session-level configuration or internal mechanism that lets the model “start speaking first” before user audio is processed?

I’d like to formally optimize this flow on my end so I can guarantee:

  • The AI always speaks first

  • There is no language confusion

  • No repeated greetings

GPT-Realtime has been fantastic so far—this is the only remaining challenge before I move ahead with broader deployment of our phone-based AI agent.

Any guidance on recommended call-answering architecture or best practices for this first-speak behavior would be greatly appreciated.

Thank you!

response.create is the right mechanism. You might also try using our SIP API (rather than websockets) for better responsiveness.

1 Like

Thanks for the response! Just to clarify, I’m using the SIP API (Twilio → OpenAI Realtime SIP integration).

I am only using a WebSocket connection to:

  • Monitor call events (session.created, conversation.item.created, etc.)

  • Send response.create for the initial greeting.

The model is responding much better now! I’m running on Google Cloud Platform (Firebase Functions), and keeping a warm instance of the cloud function (minInstances: 1) sped things up significantly. The initial greeting now consistently goes out before the caller begins speaking.
Documenting this here in case it can help others.

Makes sense, sounds like it can take multiple seconds to spin up a cold function. You might also want to look at val.town, which is what I’ve been using for realtime API demos and has very fast cold startup.

1 Like

Sounds good! I have been using your sample code there as a guide juberti/hello-realtime | Val Town :grinning_face: