Realtime API WebRTC `conversation.item.truncate` not canceling audio?

My hair is falling out over this.

I am trying to add a “simple” Interrupt/Stop functionality to my Kotlin Android Phone+Wear PushToTalk app (GitHub - swooby/AlfredAI: OpenAI Realtime API over WebRTC Push-To-Talk Android Phone[/Mobile] + Watch[/Wear] + [Bluetooth]AudioRouting).

Fairly simply:

  1. I send conversation.item.create with "tell me a story" + response.create
  2. The server responds my item is created
  3. The server sends that its response "item":{"id":"item_AxNzYCRoDo1vCDRPWmWWT" ...} is created
  4. The server starts sending response.audio_transcript.delta with text deltas for "item_id":"item_AxNzYCRoDo1vCDRPWmWWT"
  5. I wait 2-3 seconds and then send:
    1. {"type":"conversation.item.truncate","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","content_index":0,"audio_end_ms":2449,"event_id":"evt_636dWnY4wA5Z4EobD"}
      The documentation (https://platform.openai.com/docs/api-reference/realtime-client-events/conversation/item/truncate) says “If successful, the server will respond with a conversation.item.truncated event.”
    2. {"type":"response.cancel","event_id":"evt_Y4rqqH4yVHG7gSLcW"}
      I do not specify response_id. The documentation (https://platform.openai.com/docs/api-reference/realtime-client-events/response/cancel#realtime-client-events/response/cancel-response_id) says “A specific response ID to cancel - if not provided, will cancel an in-progress response in the default conversation.”
  6. The server responds:
    1. {"type":"conversation.item.truncated",...,"item_id":"item_AxNzYCRoDo1vCDRPWmWWT","content_index":0,"audio_end_ms":2449}
    2. {"type":"response.audio.done",...,"response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0}"
    3. {"type":"response.audio_transcript.done",...,"response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"transcript":"Once upon a time, ... with the"}
    4. "type":"response.content_part.done",...,"response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"part":{"type":"audio","transcript":"Once upon a time, ... with the"}}"
    5. {"type":"response.output_item.done",...,"response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","output_index":0,"item":{"id":"item_AxNzYCRoDo1vCDRPWmWWT","object":"realtime.item","type":"message","status":"incomplete","role":"assistant","content":[{"type":"audio","transcript":"Once upon a time, ... with the"}]}}"
    6. {"type":"response.done","...,"response":{"object":"realtime.response","id":"resp_AxNzYuSUmeIn7UuMFAwz7","status":"cancelled","status_details":{"type":"cancelled","reason":"client_cancelled"},"output":[{"id":"item_AxNzYCRoDo1vCDRPWmWWT","object":"realtime.item","type":"message","status":"incomplete","role":"assistant","content":[{"type":"audio","transcript":"Once upon a time, ... with the"}]}],"conversation_id":"conv_AxNyZqBJYhtuLWtcSa2J9","modalities":["audio","text"],"voice":"ash","custom_voice_id":null,"output_audio_format":"pcm16","temperature":0.800000011920929,"max_output_tokens":1024,"usage":...}}"

At this point I would expect the audio to stop streaming.
But it does not.
It just keeps coming.

Then, 8 seconds later the server sends:
8. {"type":"output_audio_buffer.audio_stopped","event_id":"event_8230e7b637584b08","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7"}
NOTE that output_audio_buffer.audio_stopped is not documented anywhere at https://platform.openai.com/docs/api-reference/realtime

I have seen plenty of demos of conversation.item.truncated working properly, but they are all WebSocket based.
I have yet to find a WebRTC based demo of conversation.item.truncated showing it working.

I am working on implementing using my own AudioTrack player and queuing incoming audio and flushing it when I see a response.audio.done server event, but:

  1. I feel like I must be doing something wrong, because I don’t see any other WebRTC implementations going to this extreme.
  2. I don’t think flushing all received audio buffers will do much good if the server really is still streaming the audio to me until output_audio_buffer.audio_stopped is received.

Does anyone have a WebRTC demo of conversation.item.truncate working correctly?

I am tempted to write a simple Hello World app to make this more convincing, but I have not gotten around to that yet.
I would obviously ask ChatGPT to help me, but no models, not even o3, have any knowledge of the OpenAI Realtime API introduced 2024/08/01. :confused:
I would want to implement it using both WebSocket and WebRTC to A/B test any behavior differences.
Even a basic JavaScript dual impl might be enough to prove if everything is working fine and it must be a problem with either my code or GitHub - webrtc-sdk/android: WebRTC pre-compiled library for android..

The full log (still cut down a bit to not be ridiculously large):

TEXT "tell me a story" SENT AT 2025-02-04 16:55:39.800

2025-02-04 16:55:39.867 dataSendText: message(169 chars TEXT)="{"type":"conversation.item.create","item":{"type":"message","role":"user","content":[{"type":"input_text","text":"tell me a story"}]},"event_id":"evt_aSqFUCkRuUPCaCCbD"}"
2025-02-04 16:55:39.871 onBufferedAmountChange(169)
2025-02-04 16:55:39.901 dataSendText: message(61 chars TEXT)="{"type":"response.create","event_id":"evt_Q1FUYdNkZRnPVcJ7u"}"
...
2025-02-04 16:55:39.937 onDataChannelText: message(280 chars TEXT)="{"type":"conversation.item.created","event_id":"event_AxNzY1OLCtlVOulMMRw82","previous_item_id":null,"item":{"id":"item_AxNzYRPJRjWLWmOf6IcOr","object":"realtime.item","type":"message","status":"completed","role":"user","content":[{"type":"input_text","text":"tell me a story"}]}}"
2025-02-04 16:55:40.039 onDataChannelText: message(431 chars TEXT)="{"type":"response.created","event_id":"event_AxNzYpogVKMb5F9UBpJLj","response":{"object":"realtime.response","id":"resp_AxNzYuSUmeIn7UuMFAwz7","status":"in_progress","status_details":null,"output":[],"conversation_id":"conv_AxNyZqBJYhtuLWtcSa2J9","modalities":["audio","text"],"voice":"ash","custom_voice_id":null,"output_audio_format":"pcm16","temperature":0.800000011920929,"max_output_tokens":1024,"usage":null,"metadata":null}}"

2025-02-04 16:55:40.546 onDataChannelText: message(229 chars TEXT)="{"type":"rate_limits.updated","event_id":"event_AxNzZTaNq8hp6eSh45QKH","rate_limits":[{"name":"requests","limit":1000,"remaining":999,"reset_seconds":86.4},{"name":"tokens","limit":40000,"remaining":38387,"reset_seconds":2.419}]}"

2025-02-04 16:55:40.554 onDataChannelText: message(278 chars TEXT)="{"type":"response.output_item.added","event_id":"event_AxNzZ8YG7KM53H0NMDBr6","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","output_index":0,"item":{"id":"item_AxNzYCRoDo1vCDRPWmWWT","object":"realtime.item","type":"message","status":"in_progress","role":"assistant","content":[]}}"
...
2025-02-04 16:55:40.572 onDataChannelText: message(265 chars TEXT)="{"type":"conversation.item.created","event_id":"event_AxNzZidaeBt1amiQTpzVt","previous_item_id":"item_AxNzYRPJRjWLWmOf6IcOr","item":{"id":"item_AxNzYCRoDo1vCDRPWmWWT","object":"realtime.item","type":"message","status":"in_progress","role":"assistant","content":[]}}"
2025-02-04 16:55:40.578 onDataChannelText: message(236 chars TEXT)="{"type":"response.content_part.added","event_id":"event_AxNzZDdy4NgjDYzHTTuXj","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"part":{"type":"audio","transcript":""}}"
2025-02-04 16:55:40.587 onDataChannelText: message(215 chars TEXT)="{"type":"response.audio_transcript.delta","event_id":"event_AxNzZiUCJ9MfytROpvXto","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"delta":"Once"}"
2025-02-04 16:55:40.603 onDataChannelText: message(216 chars TEXT)="{"type":"response.audio_transcript.delta","event_id":"event_AxNzZhgfJ8PMUvr8JaAWg","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"delta":" upon"}"
2025-02-04 16:55:40.606 onDataChannelText: message(213 chars TEXT)="{"type":"response.audio_transcript.delta","event_id":"event_AxNzZNPg7WroDe2qfQkO0","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"delta":" a"}"
2025-02-04 16:55:40.619 onDataChannelText: message(216 chars TEXT)="{"type":"response.audio_transcript.delta","event_id":"event_AxNzZsupk1CTDWJcQUwvR","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"delta":" time"}"
...
2025-02-04 16:55:42.920 onDataChannelText: message(216 chars TEXT)="{"type":"response.audio_transcript.delta","event_id":"event_AxNzbp9LIkCGEiJKG04WC","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"delta":" with"}"
2025-02-04 16:55:42.935 onDataChannelText: message(215 chars TEXT)="{"type":"response.audio_transcript.delta","event_id":"event_AxNzbgK8L6hw60wTgf4wI","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"delta":" the"}"

STOP (aka: `conversation.item.truncate` + `response.cancel`) PRESSED AT 2025-02-04 16:55:43

2025-02-04 16:55:43.055 dataSendText: message(149 chars TEXT)="{"type":"conversation.item.truncate","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","content_index":0,"audio_end_ms":2449,"event_id":"evt_636dWnY4wA5Z4EobD"}"
2025-02-04 16:55:43.056 onBufferedAmountChange(149)
2025-02-04 16:55:43.062 dataSendText: message(61 chars TEXT)="{"type":"response.cancel","event_id":"evt_Y4rqqH4yVHG7gSLcW"}"
2025-02-04 16:55:43.065 onBufferedAmountChange(61)
2025-02-04 16:55:43.134 onDataChannelText: message(156 chars TEXT)="{"type":"conversation.item.truncated","event_id":"event_AxNzcdNie3TV8osKNjN4Q","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","content_index":0,"audio_end_ms":2449}"
2025-02-04 16:55:43.145 onDataChannelText: message(188 chars TEXT)="{"type":"response.audio.done","event_id":"event_AxNzcJS2Ny42isx9esvcE","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0}"
2025-02-04 16:55:43.162 onDataChannelText: message(439 chars TEXT)="{"type":"response.audio_transcript.done","event_id":"event_AxNzct2ea65VRtKlZx74h","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"transcript":"Once upon a time, in a land where the mountains touched the sky and the rivers sang with the voice of the earth, there lived a young wanderer named Elara. Elara had a heart full of curiosity and a spirit that burned with the"}"
2025-02-04 16:55:43.166 onDataChannelText: message(459 chars TEXT)="{"type":"response.content_part.done","event_id":"event_AxNzcGm3PL9tDYI5cVmP5","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"part":{"type":"audio","transcript":"Once upon a time, in a land where the mountains touched the sky and the rivers sang with the voice of the earth, there lived a young wanderer named Elara. Elara had a heart full of curiosity and a spirit that burned with the"}}"
2025-02-04 16:55:43.176 onDataChannelText: message(532 chars TEXT)="{"type":"response.output_item.done","event_id":"event_AxNzcIeHQsUHj5HuSvfuU","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","output_index":0,"item":{"id":"item_AxNzYCRoDo1vCDRPWmWWT","object":"realtime.item","type":"message","status":"incomplete","role":"assistant","content":[{"type":"audio","transcript":"Once upon a time, in a land where the mountains touched the sky and the rivers sang with the voice of the earth, there lived a young wanderer named Elara. Elara had a heart full of curiosity and a spirit that burned with the"}]}}"
2025-02-04 16:55:43.182 onDataChannelText: message(1109 chars TEXT)="{"type":"response.done","event_id":"event_AxNzcdY0P0jKuGyAi1PYF","response":{"object":"realtime.response","id":"resp_AxNzYuSUmeIn7UuMFAwz7","status":"cancelled","status_details":{"type":"cancelled","reason":"client_cancelled"},"output":[{"id":"item_AxNzYCRoDo1vCDRPWmWWT","object":"realtime.item","type":"message","status":"incomplete","role":"assistant","content":[{"type":"audio","transcript":"Once upon a time, in a land where the mountains touched the sky and the rivers sang with the voice of the earth, there lived a young wanderer named Elara. Elara had a heart full of curiosity and a spirit that burned with the"}]}],"conversation_id":"conv_AxNyZqBJYhtuLWtcSa2J9","modalities":["audio","text"],"voice":"ash","custom_voice_id":null,"output_audio_format":"pcm16","temperature":0.800000011920929,"max_output_tokens":1024,"usage":{"total_tokens":488,"input_tokens":183,"output_tokens":305,"input_token_details":{"text_tokens":183,"audio_tokens":0,"cached_tokens":0,"cached_tokens_details":{"text_tokens":0,"audio_tokens":0}},"output_token_details":{"text_tokens":68,"audio_tokens":237}},"metadata":null}}"

... 8 SECONDS PASS!!

2025-02-04 16:55:51.188 onDataChannelText: message(123 chars TEXT)="{"type":"output_audio_buffer.audio_stopped","event_id":"event_8230e7b637584b08","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7"}"
2025-02-04 16:55:51.189 onDataChannelText: undocumented `output_audio_buffer.audio_stopped`

kthnxbye

I encountered this today as well. A temp fix I had was to mute the WebRTC line. FYI speech_started clears the output buffer so when the user begins talking again you can just unmute the output buffer line when you get the input_buffer_committed event.

1 Like

What language are you using? How do you mute the WebRTC line? Just the audio track?

I am using Kotlin and will look into disabling the PeerConnection observer onAddTrack audioTracks(s) that I get see:

override fun onAddTrack(receiver: RtpReceiver, mediaStreams: Array<MediaStream>) {
    val track = receiver.track()
    val trackKind = track?.kind()
    if (trackKind == AudioTrack.AUDIO_TRACK_KIND) {
        remoteAudioTracks.add(track as AudioTrack)
    }
}

override fun setLocalAudioTrackSpeakerEnabled(enabled: Boolean) {
    remoteAudioTracks.forEach {
        it.setEnabled(enabled)
    }
}

Nope!

My speaker enable/disable code is:

    private var isSpeakerEnabled: Boolean = true

    override fun setLocalAudioTrackSpeakerEnabled(enabled: Boolean) {
        log.w("setLocalAudioTrackSpeakerEnabled($enabled)")
        if (isSpeakerEnabled != enabled) {
            isSpeakerEnabled = enabled
            remoteAudioTracks.forEach {
                it.setEnabled(enabled)
            }
        }
    }

I disable it by default.
I only enable it on:

  • output_audio_buffer.audio_started
  • response.audio_transcript.delta
  • response.content_part.added

I only disable it on:

  • conversation.item.truncated
  • output_audio_buffer.audio_stopped

Sending truncation and response cancel request does respond that it stopped the signaling, but as expectedly unexpected the audio is still buffering, and if still receiving audio I request the server to talk again (via text or audio request) I get a continuation of the old audio even though I got a response from the server that it successfully truncated the old audio!!.

As mentioned elsewhere out there on the Internet, the AudioTrack and event signaling are clearly disconnected, but string things still seem afoot at the Circle K here. :confused:

Ah that’s funny, I’m also mobile but Swift. My WebRTC client implementation is likely slightly different, but here’s the relevant code:

func muteRemoteAudio() {
        peerConnection.transceivers
            .compactMap { $0.receiver.track as? RTCAudioTrack }
            .forEach { $0.isEnabled = false }
    }

    func unmuteRemoteAudio() {
        peerConnection.transceivers
            .compactMap { $0.receiver.track as? RTCAudioTrack }
            .forEach { $0.isEnabled = true }
    }

Where peerConnection was init as:

        let config = RTCConfiguration()
        config.iceServers = [RTCIceServer(urlStrings: iceServers)]
        config.sdpSemantics = .unifiedPlan
        config.continualGatheringPolicy = .gatherContinually
        
        let constraints = RTCMediaConstraints(mandatoryConstraints: nil,
                                              optionalConstraints: ["DtlsSrtpKeyAgreement":kRTCMediaConstraintsValueTrue])
        
        guard let peerConnection = WebRTCClient.factory.peerConnection(with: config, constraints: constraints, delegate: nil) else {
            fatalError("Could not create new RTCPeerConnection")
        }
        self.peerConnection = peerConnection
1 Like