Need help being able to interrupt the Realtime API response

So i am working on a customer service bot using the Realtime API and Twilio to handle the calls. I have been able to do well with most issues but I am just so stumped in figuring out why I can’t interrupt the AI. I will ask it something mid response and it will just keep talking over me and wait till it gets to the end of its sentence before acknowledging what I said and then continuing on without stopping. I have tried using the response.cancel and output_audio_buffer.clear but no dice. Does anyone have any clue or idea on how to go about this?

Just an FYI I am doing this on Python.
Here is an example of how I am sending the commands:

clear_audio_buffer = {"type": "output_audio_buffer.clear"}
await openai_ws.send(json.dumps(clear_audio_buffer))

Edit: I found a solution!

                    if response["type"] == "input_audio_buffer.speech_started":
                        print('Speech Start:', response['type'])
                        # Clear Twilio buffer
                        clear_twilio = {
                            "streamSid": stream_sid,
                            "event": "clear"
                        }
                        await websocket.send_json(clear_twilio)
                        print('Cleared Twilio buffer.')
                        
                        # Send interrupt message to OpenAI
                        interrupt_message = {
                            "type": "response.cancel"
                        }
                        await openai_ws.send(json.dumps(interrupt_message))
                        print('Cancelling AI speech from the server.')
1 Like

Hi There,

I’m also using the RealTimeAPI in Python (though not with twilio, at least not yet) and the interruptions work. Did you specify the server_vad for turn detection in the session?

Yep i have the selection as server_vad :frowning: could you share how your session update looks??? Pretty stumped on this honestly.

where you find information about that stuff? Their git repo?

:slight_smile:

1 Like

To whoever is reading this in the future, I found a solution. Here is how I implemented it into my code:

                if response["type"] == "input_audio_buffer.speech_started":
                    print('Speech Start:', response['type'])
                    # Clear Twilio buffer
                    clear_twilio = {
                        "streamSid": stream_sid,
                        "event": "clear"
                    }
                    await websocket.send_json(clear_twilio)
                    print('Cleared Twilio buffer.')
                    
                    # Send interrupt message to OpenAI
                    interrupt_message = {
                        "type": "response.cancel"
                    }
                    await openai_ws.send(json.dumps(interrupt_message))
                    print('Cancelling AI speech from the server.')
5 Likes

Thanks for sharing, dude!!! Cheers!

1 Like

thanks for sharing but this works too well for me. I have a fan running in the room and sometimes it thinks the fan noise is an interruption. is there way to somehow modify the sensitivity?

I have done Python implementation which works pretty well however I cannot interrupt the assistant even if I send the response.cancel during the output audio streaming

Here is my test scenario:

  • I am using client-side VAD (also tested with server vad and the same problem) which detects the speech and if I start speaking while AI is outputting audio, “response.cancel” event is sent to the OpenAI service.
  • At start, I tell AI to count to 20.
  • AI starts to count, audio playback happens smoothly
  • I say please stop counting, “response.cancel” event is sent to service

Problem:

  • AI do not cancel the response, but always completes the response (ie. counts to till the end)

Here is part of the trace for more details:

INFO:main:Session updated: {‘id’: ‘sess_AHaqRB3pQQwCdnY5XCHKZ’, ‘object’: ‘realtime.session’, ‘model’: ‘gpt-4o-realtime-preview’, ‘expires_at’: 1728757471, ‘modalities’: [‘text’, ‘audio’], ‘instructions’: ‘You are a helpful assistant. Respond concisely. If user asks to tell story, tell story very shortly.’, ‘voice’: ‘alloy’, ‘turn_detection’: None, ‘input_audio_format’: ‘pcm16’, ‘output_audio_format’: ‘pcm16’, ‘input_audio_transcription’: {‘model’: ‘whisper-1’}, ‘tool_choice’: ‘auto’, ‘temperature’: 0.8, ‘max_response_output_tokens’: ‘inf’, ‘tools’: [{‘name’: ‘get_weather’, ‘description’: ‘Get the current weather for a location.’, ‘parameters’: {‘type’: ‘object’, ‘properties’: {‘location’: {‘type’: ‘string’}}, ‘required’: [‘location’]}, ‘type’: ‘function’}]}
INFO:vad:Speech started
INFO:main:Speech has started.
INFO:main:Sending audio data to the client.

… I said “Please count to 20”

INFO:main:New Conversation Item: {‘id’: ‘item_AHaqWiaRzvSTacllhd5u8’, ‘object’: ‘realtime.item’, ‘type’: ‘message’, ‘status’: ‘in_progress’, ‘role’: ‘assistant’, ‘content’: }
INFO:main:New Part Added: {‘type’: ‘audio’, ‘transcript’: ‘’}
INFO:main:Transcript Delta: Sure
INFO:main:Transcript Delta: ,
INFO:main:Transcript Delta: here
INFO:main:Transcript Delta: you
INFO:main:Transcript Delta: go
INFO:main:Transcript Delta: :
INFO:main:Received audio delta for Response ID resp_AHaqWXORK0F9uAdKQAd9i, Item ID item_AHaqWiaRzvSTacllhd5u8, Content Index 0
INFO:main:Received audio delta for Response ID resp_AHaqWXORK0F9uAdKQAd9i, Item ID item_AHaqWiaRzvSTacllhd5u8, Content Index 0
INFO:main:Transcript Delta: One
INFO:main:Received audio delta for Response ID resp_AHaqWXORK0F9uAdKQAd9i, Item ID item_AHaqWiaRzvSTacllhd5u8, Content Index 0
INFO:main:Transcript Delta: ,
INFO:main:Received audio delta for Response ID resp_AHaqWXORK0F9uAdKQAd9i, Item ID item_AHaqWiaRzvSTacllhd5u8, Content Index 0
INFO:main:Received audio delta for Response ID resp_AHaqWXORK0F9uAdKQAd9i, Item ID item_AHaqWiaRzvSTacllhd5u8, Content Index 0
INFO:main:Transcript Delta: two

… I try to stop the assistant by saying “Please stop”

INFO:vad:Speech started
INFO:main:Speech has started.
INFO:main:User started speaking while audio is playing.
INFO:main:Clearing input audio buffer.
INFO:main:Cancelling response.
INFO:main:Truncate the current audio, current item ID: item_AHaqWiaRzvSTacllhd5u8, current audio content index: 0
CRITICAL:openai_realtime_common.web_socket_manager:Sending message: {“type”: “response.cancel”}
INFO:main:Sending audio data to the client.
INFO:main:Sending audio data to the client.
INFO:main:Received audio delta for Response ID resp_AHaqWXORK0F9uAdKQAd9i, Item ID item_AHaqWiaRzvSTacllhd5u8, Content Index 0
INFO:main:Sending audio data to the client.
INFO:main:Sending audio data to the client.

INFO:main:Transcript Delta: eight

… I have said “Please stop!”

INFO:vad:Speech ended
INFO:main:Speech has ended
INFO:main:Requesting the client to generate a response.

INFO:main:Received audio delta for Response ID resp_AHaqWXORK0F9uAdKQAd9i, Item ID item_AHaqWiaRzvSTacllhd5u8, Content Index 0
INFO:main:Received audio delta for Response ID resp_AHaqWXORK0F9uAdKQAd9i, Item ID item_AHaqWiaRzvSTacllhd5u8, Content Index 0
INFO:main:Transcript Delta: eleven

… Assistant just continues

INFO:main:Audio done for response ID resp_AHaqWXORK0F9uAdKQAd9i, item ID item_AHaqWiaRzvSTacllhd5u8
INFO:main:Audio transcript done: ‘Sure, here you go: One, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty.’ for response ID resp_AHaqWXORK0F9uAdKQAd9i
INFO:main:Content part done: ‘’ of type ‘audio’ for response ID resp_AHaqWXORK0F9uAdKQAd9i
INFO:main:Output item done for response ID resp_AHaqWXORK0F9uAdKQAd9i with content: [{‘type’: ‘audio’, ‘transcript’: ‘Sure, here you go: One, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty.’}]
INFO:main:Response completed with status ‘completed’ and ID ‘resp_AHaqWXORK0F9uAdKQAd9i’

… Later assistant say it will stops but this comes too late

INFO:main:New Conversation Item: {‘id’: ‘item_AHaqdkiBS12XzLRE39wTo’, ‘object’: ‘realtime.item’, ‘type’: ‘message’, ‘status’: ‘in_progress’, ‘role’: ‘assistant’, ‘content’: }
INFO:main:New Part Added: {‘type’: ‘audio’, ‘transcript’: ‘’}
INFO:main:Transcript Delta: Alright
INFO:main:Transcript Delta: ,
INFO:main:Transcript Delta: I’ll
INFO:main:Transcript Delta: stop
INFO:main:Transcript Delta: .

Questions:

  • The received audio delta content index is 0 always, I wonder if assistant is able to interrupt the audio of the current index?
  • Sometimes I receive response done with cancelled status so I am pretty sure “response.cancel” event goes to service.

Would appreciate if there is something to fix the issue, I would like to be able to interrupt the assistant to make the conversation more alive.

Yes, you can set the server side VAD threshold for the API but you can also adjust the audio levels for the input with whatever audio library you use to accommodate for a noisy environment (can also be done dynamically based on the base audio level of your input device which can change depending on your environment). The only problem I’m facing now is with the assistant interrupting and answering itself when I’m not using headphones. I’ve been trying to implement some sort of Acoustic Echo Cancellation in Python but had no success so far. Does any body have a solution for this?

Hi and welcome to the forum!

This seems like a case where some noise reduction/removal algo is required.

Kind of thing Dolby used to do/still does? I’m sure a quick look on github will pull back a vast number of background noise suppression algos, many real-time, The kind of thing that Discord, Zoom, Teams and Google meets uses.

I agree it would be nice to have that as an option on the endpoint, but for now at least, it seems the API requires a clean audio source as a prerequisite.

1 Like