Interruption not implemented out of the box in the Twilio Example

I’m thrilled to start my implementation journey! As soon as I saw the update on X, I immediately checked out the Twilio tutorial.

I was able to get a POC up and running quickly!

However, the example doesn’t cover how to handle interrupts, such as interrupting the LLM’s audio output simply by speaking, similar to advanced voice mode. I’ve spent the last four hours working on my own solution (and I’ll keep at it), but I’d appreciate it if anyone could review the tutorial’s source code and suggest any approaches or modifications to make interrupts work. Thanks!

1 Like

The actual code for the realtime api gives a clue for this. It looks like the interruption doesn’t happen automatically. you get told that an interruption is occurring and then you have to actually cancel the current generation:

The realtime API has a cancelResponse() method that shows how to cancel the current generation:

1 Like

I’ve attempted both client methods at various server event points, along with some custom in memory flags like userTalking:boolean but no dice. Will sleep on it and attempt first thing in the AM. Hunch is it MAY be a twilio limitation.(will look into their bidirectional streaming)

I appreciate your help.

When debugging, I noticed that the server’s audio response is in PCM16 format, which is different from what we initialized in the WebSocket g711_ulaw. Do you think this could be related to the issue?

1 Like

have no idea my friend - but hey. I have all weekend to figure this out, but personally shocked Twilio would announce this partnership and fail to deliver on what feels like the base case. :man_shrugging:

I’ve been busy the last couple of days, but back when tts-1 and whisper had been released, I built my own production ready conversation feature. While building it, I had to work with the threshold of what is considered silence.

this was challenging since someone in a noisy environment (lets say a cafe) wold seem to never stop talking. so there is certainly some ways to solve this, but they are not the easiest.

What I’m trying to get to, and on the playground you can test this, I think there is a parameter to the realtime endpoint for the threshold of silence. Modifying that might wield better results. But then again, I haven’t had the time to explore much given how much time I’ve been coding at work. Maybe if I’m not super tired on sunday I’ll check it out and be better informed.

finaly, it’s works perfectly :

case 'input_audio_buffer.speech_started':
        console.log('Speech Start:', response.type);
        twilioWs.send(
        JSON.stringify({
          streamSid: streamSid,
          event: 'clear',
        })
      );
      console.log('Cancelling AI speech from the server');
      const interruptMessage = {
          type: 'response.cancel'
      };
      openaiWs.send(JSON.stringify(interruptMessage));
    }
2 Likes

finaly, it’s works perfectly :

case 'input_audio_buffer.speech_started':
        console.log('Speech Start:', response.type);
        twilioWs.send(
        JSON.stringify({
          streamSid: streamSid,
          event: 'clear',
        })
      );
      console.log('Cancelling AI speech from the server');
      const interruptMessage = {
          type: 'response.cancel'
      };
      openaiWs.send(JSON.stringify(interruptMessage));
    }

Edit: You have to see this PR because you need to manage interrupt handling in both side - Twilio and OpenAI. When the user speaks and OpenAI sends input_audio_buffer.speech_started, the code in the PR clears the Twilio Media Streams buffer and sends conversation.item.truncate to OpenAI which is so important in this case."

6 Likes

Amazing :star_struck: I was just trying to fix this as well. Here’s your code in Python:


if response['type'] == 'input_audio_buffer.speech_started':
    print('Speech Start:', response['type'])
    
    # Send clear event to Twilio
    await websocket.send_json({
        "streamSid": stream_sid,
        "event": "clear"
    })
    
    print('Cancelling AI speech from the server')
    
    # Send cancel message to OpenAI
    interrupt_message = {
        "type": "response.cancel"
    }
    await openai_ws.send(json.dumps(interrupt_message))

Works great!

@seagermack
Amazing Find! Was stuck with this issue too after integrating custom RAG to realtime api.

were you able to figure out how to get the custom rag working with the realtime api / twilio project? i am itching to get that figured out for myself!

Dude this worked, thank you! I’ve been trying to figure out how to interrupt twilio audio like this for a week now! :sweat_smile:

May I ask where in the Twilio example code, would this need to be inserted? I assume it will be in the following section of the code ?

/ Listen for messages from the OpenAI WebSocket (and send to Twilio if necessary)
openAiWs.on(‘message’, (data) => {
try {
const response = JSON.parse(data);

            if (LOG_EVENT_TYPES.includes(response.type)) {
                console.log(`Received event: ${response.type}`, response);
            }


Thanks in advance.

1 Like

Has anyone been able to figure out function calling in an example with Twilio because I have not. Adding function calling seems to keep breaking the script.

Format it exactly like this:

  "tools": [
    {
      "type": "function",
      "name": "xxx",
      "description": "xxx",
      "parameters": {
        "type": "object",
        "properties": {
          "xxx": {
            "type": "string"
          }
        },
        "required": [
          "xxx"
        ]
      }
    }
  ]

Hey. Is there is a reason you clear the Twillio steam before sending the cancel message to the Realtime API?

The current implementation sends the entire audio from OpenAI to Twilio immediately, placing it in a queue for playback. As a result, there isn’t a way to cancel the playback during an interruption because the audio is already sent to Twilio. Clearing the Twilio stream before sending the cancel message to the Realtime API is necessary to ensure that any queued audio playback is stopped before the new instruction is processed, especially because the response from the OpenAI Realtime API is much faster than Twilio’s playback. by the way sending response.cancel is not required when using server_vad

2 Likes

I saw this too, not sure why it overrides ulaw to pcm