Need help being able to interrupt the Realtime API response

sleep0holic · October 9, 2024, 7:05am

So i am working on a customer service bot using the Realtime API and Twilio to handle the calls. I have been able to do well with most issues but I am just so stumped in figuring out why I can’t interrupt the AI. I will ask it something mid response and it will just keep talking over me and wait till it gets to the end of its sentence before acknowledging what I said and then continuing on without stopping. I have tried using the response.cancel and output_audio_buffer.clear but no dice. Does anyone have any clue or idea on how to go about this?

Just an FYI I am doing this on Python.
Here is an example of how I am sending the commands:

clear_audio_buffer = {"type": "output_audio_buffer.clear"}
await openai_ws.send(json.dumps(clear_audio_buffer))

Edit: I found a solution!

                    if response["type"] == "input_audio_buffer.speech_started":
                        print('Speech Start:', response['type'])
                        # Clear Twilio buffer
                        clear_twilio = {
                            "streamSid": stream_sid,
                            "event": "clear"
                        }
                        await websocket.send_json(clear_twilio)
                        print('Cleared Twilio buffer.')
                        
                        # Send interrupt message to OpenAI
                        interrupt_message = {
                            "type": "response.cancel"
                        }
                        await openai_ws.send(json.dumps(interrupt_message))
                        print('Cancelling AI speech from the server.')

aaron.lutz · October 9, 2024, 8:06am

Hi There,

I’m also using the RealTimeAPI in Python (though not with twilio, at least not yet) and the interruptions work. Did you specify the server_vad for turn detection in the session?

sleep0holic · October 9, 2024, 2:53pm

Yep i have the selection as server_vad could you share how your session update looks??? Pretty stumped on this honestly.

guicalabria · October 9, 2024, 6:44pm

where you find information about that stuff? Their git repo?

sleep0holic · October 9, 2024, 11:27pm

sleep0holic · October 10, 2024, 5:54am

To whoever is reading this in the future, I found a solution. Here is how I implemented it into my code:

                if response["type"] == "input_audio_buffer.speech_started":
                    print('Speech Start:', response['type'])
                    # Clear Twilio buffer
                    clear_twilio = {
                        "streamSid": stream_sid,
                        "event": "clear"
                    }
                    await websocket.send_json(clear_twilio)
                    print('Cleared Twilio buffer.')
                    
                    # Send interrupt message to OpenAI
                    interrupt_message = {
                        "type": "response.cancel"
                    }
                    await openai_ws.send(json.dumps(interrupt_message))
                    print('Cancelling AI speech from the server.')

guicalabria · October 10, 2024, 6:04am

Thanks for sharing, dude!!! Cheers!

mcpower2 · October 10, 2024, 10:01pm

thanks for sharing but this works too well for me. I have a fan running in the room and sometimes it thinks the fan noise is an interruption. is there way to somehow modify the sensitivity?

jhakulin · October 12, 2024, 6:25pm

I have done Python implementation which works pretty well however I cannot interrupt the assistant even if I send the response.cancel during the output audio streaming

Here is my test scenario:

I am using client-side VAD (also tested with server vad and the same problem) which detects the speech and if I start speaking while AI is outputting audio, “response.cancel” event is sent to the OpenAI service.
At start, I tell AI to count to 20.
AI starts to count, audio playback happens smoothly
I say please stop counting, “response.cancel” event is sent to service

Problem:

AI do not cancel the response, but always completes the response (ie. counts to till the end)

Here is part of the trace for more details:

INFO:main:Session updated: {‘id’: ‘sess_AHaqRB3pQQwCdnY5XCHKZ’, ‘object’: ‘realtime.session’, ‘model’: ‘gpt-4o-realtime-preview’, ‘expires_at’: 1728757471, ‘modalities’: [‘text’, ‘audio’], ‘instructions’: ‘You are a helpful assistant. Respond concisely. If user asks to tell story, tell story very shortly.’, ‘voice’: ‘alloy’, ‘turn_detection’: None, ‘input_audio_format’: ‘pcm16’, ‘output_audio_format’: ‘pcm16’, ‘input_audio_transcription’: {‘model’: ‘whisper-1’}, ‘tool_choice’: ‘auto’, ‘temperature’: 0.8, ‘max_response_output_tokens’: ‘inf’, ‘tools’: [{‘name’: ‘get_weather’, ‘description’: ‘Get the current weather for a location.’, ‘parameters’: {‘type’: ‘object’, ‘properties’: {‘location’: {‘type’: ‘string’}}, ‘required’: [‘location’]}, ‘type’: ‘function’}]}
INFO:vad:Speech started
INFO:main:Speech has started.
INFO:main:Sending audio data to the client.

… I said “Please count to 20”

INFO:main:New Conversation Item: {‘id’: ‘item_AHaqWiaRzvSTacllhd5u8’, ‘object’: ‘realtime.item’, ‘type’: ‘message’, ‘status’: ‘in_progress’, ‘role’: ‘assistant’, ‘content’: }
INFO:main:New Part Added: {‘type’: ‘audio’, ‘transcript’: ‘’}
INFO:main:Transcript Delta: Sure
INFO:main:Transcript Delta: ,
INFO:main:Transcript Delta: here
INFO:main:Transcript Delta: you
INFO:main:Transcript Delta: go
INFO:main:Transcript Delta: :
INFO:main:Received audio delta for Response ID resp_AHaqWXORK0F9uAdKQAd9i, Item ID item_AHaqWiaRzvSTacllhd5u8, Content Index 0
INFO:main:Received audio delta for Response ID resp_AHaqWXORK0F9uAdKQAd9i, Item ID item_AHaqWiaRzvSTacllhd5u8, Content Index 0
INFO:main:Transcript Delta: One
INFO:main:Received audio delta for Response ID resp_AHaqWXORK0F9uAdKQAd9i, Item ID item_AHaqWiaRzvSTacllhd5u8, Content Index 0
INFO:main:Transcript Delta: ,
INFO:main:Received audio delta for Response ID resp_AHaqWXORK0F9uAdKQAd9i, Item ID item_AHaqWiaRzvSTacllhd5u8, Content Index 0
INFO:main:Received audio delta for Response ID resp_AHaqWXORK0F9uAdKQAd9i, Item ID item_AHaqWiaRzvSTacllhd5u8, Content Index 0
INFO:main:Transcript Delta: two

… I try to stop the assistant by saying “Please stop”

INFO:vad:Speech started
INFO:main:Speech has started.
INFO:main:User started speaking while audio is playing.
INFO:main:Clearing input audio buffer.
INFO:main:Cancelling response.
INFO:main:Truncate the current audio, current item ID: item_AHaqWiaRzvSTacllhd5u8, current audio content index: 0
CRITICAL:openai_realtime_common.web_socket_manager:Sending message: {“type”: “response.cancel”}
INFO:main:Sending audio data to the client.
INFO:main:Sending audio data to the client.
INFO:main:Received audio delta for Response ID resp_AHaqWXORK0F9uAdKQAd9i, Item ID item_AHaqWiaRzvSTacllhd5u8, Content Index 0
INFO:main:Sending audio data to the client.
INFO:main:Sending audio data to the client.
…
INFO:main:Transcript Delta: eight
…

… I have said “Please stop!”

INFO:vad:Speech ended
INFO:main:Speech has ended
INFO:main:Requesting the client to generate a response.

…

INFO:main:Received audio delta for Response ID resp_AHaqWXORK0F9uAdKQAd9i, Item ID item_AHaqWiaRzvSTacllhd5u8, Content Index 0
INFO:main:Received audio delta for Response ID resp_AHaqWXORK0F9uAdKQAd9i, Item ID item_AHaqWiaRzvSTacllhd5u8, Content Index 0
INFO:main:Transcript Delta: eleven

… Assistant just continues

INFO:main:Audio done for response ID resp_AHaqWXORK0F9uAdKQAd9i, item ID item_AHaqWiaRzvSTacllhd5u8
INFO:main:Audio transcript done: ‘Sure, here you go: One, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty.’ for response ID resp_AHaqWXORK0F9uAdKQAd9i
INFO:main:Content part done: ‘’ of type ‘audio’ for response ID resp_AHaqWXORK0F9uAdKQAd9i
INFO:main:Output item done for response ID resp_AHaqWXORK0F9uAdKQAd9i with content: [{‘type’: ‘audio’, ‘transcript’: ‘Sure, here you go: One, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty.’}]
INFO:main:Response completed with status ‘completed’ and ID ‘resp_AHaqWXORK0F9uAdKQAd9i’

… Later assistant say it will stops but this comes too late

INFO:main:New Conversation Item: {‘id’: ‘item_AHaqdkiBS12XzLRE39wTo’, ‘object’: ‘realtime.item’, ‘type’: ‘message’, ‘status’: ‘in_progress’, ‘role’: ‘assistant’, ‘content’: }
INFO:main:New Part Added: {‘type’: ‘audio’, ‘transcript’: ‘’}
INFO:main:Transcript Delta: Alright
INFO:main:Transcript Delta: ,
INFO:main:Transcript Delta: I’ll
INFO:main:Transcript Delta: stop
INFO:main:Transcript Delta: .

Questions:

The received audio delta content index is 0 always, I wonder if assistant is able to interrupt the audio of the current index?
Sometimes I receive response done with cancelled status so I am pretty sure “response.cancel” event goes to service.

Would appreciate if there is something to fix the issue, I would like to be able to interrupt the assistant to make the conversation more alive.

aaron.lutz · October 14, 2024, 10:21am

Yes, you can set the server side VAD threshold for the API but you can also adjust the audio levels for the input with whatever audio library you use to accommodate for a noisy environment (can also be done dynamically based on the base audio level of your input device which can change depending on your environment). The only problem I’m facing now is with the assistant interrupting and answering itself when I’m not using headphones. I’ve been trying to implement some sort of Acoustic Echo Cancellation in Python but had no success so far. Does any body have a solution for this?

Foxalabs · October 14, 2024, 10:52am

Hi and welcome to the forum!

This seems like a case where some noise reduction/removal algo is required.

Kind of thing Dolby used to do/still does? I’m sure a quick look on github will pull back a vast number of background noise suppression algos, many real-time, The kind of thing that Discord, Zoom, Teams and Google meets uses.

I agree it would be nice to have that as an option on the endpoint, but for now at least, it seems the API requires a clean audio source as a prerequisite.

keanu1 · December 15, 2024, 9:48pm

Will also note for people doing this in NodeJS, the above solution worked for me with. Noting that this didn’t work with serverVad enabled, so removing/disabling it should resolve this problem.

// Cancel the assistants response if the user is speaking
        if (response.type === "input_audio_buffer.speech_started") {
          const clearTwilio = {
            streamSid: streamSid,
            event: "clear",
          };
          webSocketConnection.send(JSON.stringify(clearTwilio));
          const interruptMessage = {
            type: "response.cancel",
          };
          openaiWebsocket.send(JSON.stringify(interruptMessage));
        }```

luka.bronzovic · February 10, 2025, 5:51pm

Have you managed to prevent the response loop (Response to the response to …) happening when using only speakers?

aaron.lutz · February 11, 2025, 10:33pm

Yes. You basically need to make sure that the mic does not pickup the output audio, i.e audio from the speakers. I used WebRTC in my frontend to do this, which has built in functionality to do AEC. But, I think their new WebRTC connection already handles that, if you want to use that.

luka.bronzovic · February 12, 2025, 5:19am

Thanks, for the suggestion.
So your setup is:
WebBrowser - (WebRTC) - App - (WebSocket) - OpenAI?

aaron.lutz · February 17, 2025, 1:46pm

Yep, exactly. My setup is kind of weird and I use a hybrid approach:

server-side backend: communication via websocket with OpenAI RealTimeAPI
client-side backend: communication with my server, also via websocket, basically forwards all events
local frontend (packaged with electron, but basically the same as if you build a web app): uses WebRTC and handles audio recording and playback. This uses WebRTC which has acoustic echo cancellation and basically filters out the audio output (as long as the output is via the same browser) from the audio input. Then, send the audio data through the local backend to my server backend and then to openai. But basically the echo cancellation just happens in the web-browser which is where the audio recording and playback happens.

hope that helps.

shariqarif19 · February 19, 2025, 7:36pm

The simplest solution can be achieved by sending the input_audio_buffer.append event with a random audio buffer. I got it working in serverVad → createResponse: false mode.

yash_mirai · March 19, 2025, 1:14pm

Hey @jhakulin did you found any solution for this ? cause i’m also trying to achieve similar thing and getting same issue like you.

jhakulin · March 21, 2025, 4:24am

Yes, the issue I had was due to buffering of playback audio done in the client side, which is faster than realtime and when the interrupt was requested, I did not flush the playback audio which was collected in client side buffer and thus the interruption did not make any effect.

sagarchawla83 · March 27, 2025, 8:46pm

Hey i tried this and it did not worked for me,
can you share that exact event body, that would be really helpful

Topic		Replies	Views
Interruption not implemented out of the box in the Twilio Example API turn-control , realtime	17	1784	October 13, 2024
[Realtime API] Audio is randomly cutting off at the end Bugs realtime	81	5163	June 16, 2025
Interrupt realtime audio with text message - WebRTC API realtime	17	1059	June 10, 2025
Not able to interupt realtime ai response API api-realtime	0	91	May 3, 2025
Unable to interrupt and stop model speaking API	5	324	February 24, 2025

Need help being able to interrupt the Realtime API response

Related topics