SIP Trunking + Realtime API Call Flow — Initial Greeting Delay & Language Mismatch

jdx1 · November 17, 2025, 7:20pm

Hi OpenAI Realtime Team,

I’m integrating the Realtime API with SIP trunking (Twilio → OpenAI Realtime SIP → Realtime WebSocket). Everything is working well overall, but I’m running into one issue that’s blocking me from doing a large-scale deployment.

Issue

When an incoming call is answered, I wait for the WebSocket connection to fully open and then use response.create to send an initial greeting. However:

Sometimes the AI repeats the initial greeting.
Sometimes the initial greeting is delayed.
If the greeting is delayed and the caller speaks first, the AI sometimes replies in the wrong language (I’ve had responses in Chinese and Spanish).

This inconsistency only happens when the user speaks before the model has delivered the greeting.

Observation

When I call 1-800-CHAT-GPT, ChatGPT seems to answer instantly—before the caller can speak at all. It feels like the greeting is being triggered in a different way than using a response.create after WebSocket open.

Question

Are you using:

The same WebSocket flow available through the public Realtime API?
or
Some session-level configuration or internal mechanism that lets the model “start speaking first” before user audio is processed?

I’d like to formally optimize this flow on my end so I can guarantee:

The AI always speaks first
There is no language confusion
No repeated greetings

GPT-Realtime has been fantastic so far—this is the only remaining challenge before I move ahead with broader deployment of our phone-based AI agent.

Any guidance on recommended call-answering architecture or best practices for this first-speak behavior would be greatly appreciated.

Thank you!

juberti · November 17, 2025, 8:19pm

response.create is the right mechanism. You might also try using our SIP API (rather than websockets) for better responsiveness.

jdx1 · November 17, 2025, 11:14pm

Thanks for the response! Just to clarify, I’m using the SIP API (Twilio → OpenAI Realtime SIP integration).

I am only using a WebSocket connection to:

Monitor call events (session.created, conversation.item.created, etc.)
Send response.create for the initial greeting.

The model is responding much better now! I’m running on Google Cloud Platform (Firebase Functions), and keeping a warm instance of the cloud function (minInstances: 1) sped things up significantly. The initial greeting now consistently goes out before the caller begins speaking.
Documenting this here in case it can help others.

juberti · November 18, 2025, 12:12am

Makes sense, sounds like it can take multiple seconds to spin up a cold function. You might also want to look at val.town, which is what I’ve been using for realtime API demos and has very fast cold startup.

jdx1 · November 18, 2025, 12:14am

Sounds good! I have been using your sample code there as a guide juberti/hello-realtime | Val Town

Topic		Replies	Views
Consistently 3–5s (sometimes 7s) INVITE→realtime.call.incoming delay on SIP Realtime; accept is <1s. Any guidance? API api-realtime	4	206	January 20, 2026
Realtime api phone use case - speaking text Feedback assistants-api , realtime	16	2290	November 5, 2024
Realtime API - Unintended Speech Control with Sequential response.create API gpt-realtime	2	191	September 25, 2025
Realtime API using SIP - Call dropping after 1 second - BYE from OpenAI API realtime	2	405	October 15, 2025
Realtime API unreliable over SIP API realtime , sip	9	1048	January 17, 2026

SIP Trunking + Realtime API Call Flow — Initial Greeting Delay & Language Mismatch

Issue

Observation

Question

Related topics