How to manage user silence in Twilio calls? [openai realtime api]

I was employing the real-time API with Twilio for outbound calls, and I aspire to handle instances of user silence. When the user remains silent, the assistant should respond in some fashion. I have successfully configured it to inquire, “Are you there?” after a specified duration. However, it also accounts for the time when the assistant is speaking. My code is written in Python using WebSocket.

1 Like

You can use a single “silence timer” that you restart whenever you receive audio from either Twilio’s media event (the user speaking) or the AI’s response.audio.delta (the AI speaking). If the timer runs out, that indicates no one has spoken at all.

In AnswerPal we use this for inbound calls: if the AI needs to look up information (function calls / searches / etc.), we automatically switch to hold music after a few seconds of silence. As soon as the AI sends new audio, we stop the music immediately.

Practical Steps:

  1. Single global ‘silence timer’
    Start or reset this timer when audio arrives (from Twilio’s media event or the AI’s response.audio.delta).

  2. On each new audio chunk
    Reset the timer—this means there is currently no silence.

  3. When the timer expires (e.g., after 5s)
    No one has spoken during that period, so treat that as genuine silence.

  4. Respond accordingly
    For example, prompt the user (“Are you still there?”) or handle the silence in another way.

This approach captures silence from both the caller and the AI in a single logical flow.

Thank you for your response. I have incorporated a silence period, which resets when response.text.delta occurs (in my case, because I am using ElevenLabs for audio). The issue is that the text delta is rapid; it completes quickly, and response.text.done also concludes, yet the assistant continues speaking, counting the time of the assistant.

I have now discovered a new event from Twilio that sends a mark when the audio playback is completed, but it also send mark before assistant done specking.

                elif data['event'] == 'mark':
                    print("✅ Twilio: finished playing assistant response")
                    reset_user_timer()
                    if silent_task and not silent_task.done():
                        # print("🔁 Cancelling old silence monitor task")
                        silent_task.cancel()
                        try:
                            await silent_task
                        except asyncio.CancelledError:
                            print("✅ Old silence monitor task cancelled")
                            # pass

                        # Start a new silence monitor task
                    silent_task = asyncio.create_task(silence_monitor())

                    if mark_queue:
                        mark_queue.pop(0)**strong text**

    def reset_user_timer():
        nonlocal last_user_speech_timestamp
        last_user_speech_timestamp = time.time()
        print(f"**** time now : {last_user_speech_timestamp:.2f}")

    async def silence_monitor():
        try:
            while True:
                await asyncio.sleep(1)
                elapsed = time.time() - last_user_speech_timestamp
                if elapsed > SILENCE_TIMEOUT_SECONDS:
                    print(f"⏰ Silence detected: {elapsed:.2f}s. Prompting user...")
                    await prompt_user_for_response()
                    reset_user_timer()
        except asyncio.CancelledError:
            print("🛑 silence_monitor task cancelled")

We’ve also experienced that response.text.done often arrives earlier than the actual audio finishing. This is especially problematic because we don’t want the AI to start listening during the first greeting message when picking up the phone. In our case, callers often interject with a quick “ooh” when they realize it’s an AI, and that can derail the AI’s flow.

As a workaround, we effectively make the AI “deaf” for the first 10 seconds while the greeting message plays. Our greeting typically runs 10–15 seconds, depending on the language and any personal details (e.g., name) we include. Unfortunately, there isn’t currently a more precise way to detect when the audio playback truly ends, since response.text.done just indicates that the text has been generated, not that the audio is fully done playing.

We also play a greeting audio[in our case assistant greet very late ]. How does your assistant manage to greet again, given that it doesn’t have memory of the greeting message? When a user says hello again, it greets once more. Where do you provide the greeting audio?
In silence, do you know any logic we can implement on our own?

Here’s how we do it:


// 2) Immediately send the greeting as a "user" message
const greetEvent = {
  type: "conversation.item.create",
  item: {
    type: "message",
    role: "user",
    content: [
      {
        type: "input_text",
        text: `Greet the caller in their own language, using a time-of-day awareness. For example, if it is morning in their local time zone, say "Good morning," if it is afternoon, say "Good afternoon," and so on (even if the language is not English, adapt accordingly). If you know the caller's name from the EndUser information, address them by name, e.g. "Good afternoon, Thierry De Decker, welcome to iPower. I am AnswerPal, your 24 on 7 digital assistant. How can I help you today?" If no name is available, greet them in a polite, friendly manner. Mention that this conversation is being recorded for quality assurance purposes. Never repeat the greeting.`,
      },
    ],
  },
};
JsonSend(session.modelConn, greetEvent);

We explicitly include instructions not to repeat the greeting, yet it still happens occasionally. In our experience, this repetition is often triggered by loud background noises or partial/unclear user speech. The AI thinks it has to greet again because it interprets those sounds as the user prompting another greeting. It also happens if callers jump in with random keywords that the AI can’t interpret properly—so it falls back to the greeting logic.

In practice, it works best when users speak to the AI the same way they would to a real human conversation partner. That tends to reduce misinterpretations and repeated greetings, but there’s unfortunately no foolproof way to prevent the AI from re-triggering the greeting if it detects ambiguous input.

1 Like