My hair is falling out over this.
I am trying to add a “simple” Interrupt/Stop functionality to my Kotlin Android Phone+Wear PushToTalk app (GitHub - swooby/AlfredAI: OpenAI Realtime API over WebRTC Push-To-Talk Android Phone[/Mobile] + Watch[/Wear] + [Bluetooth]AudioRouting).
Fairly simply:
- I send
conversation.item.create
with"tell me a story"
+response.create
- The server responds my item is created
- The server sends that its response
"item":{"id":"item_AxNzYCRoDo1vCDRPWmWWT" ...}
is created - The server starts sending
response.audio_transcript.delta
with text deltas for"item_id":"item_AxNzYCRoDo1vCDRPWmWWT"
- I wait 2-3 seconds and then send:
{"type":"conversation.item.truncate","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","content_index":0,"audio_end_ms":2449,"event_id":"evt_636dWnY4wA5Z4EobD"}
The documentation (https://platform.openai.com/docs/api-reference/realtime-client-events/conversation/item/truncate) says “If successful, the server will respond with aconversation.item.truncated event
.”{"type":"response.cancel","event_id":"evt_Y4rqqH4yVHG7gSLcW"}
I do not specifyresponse_id
. The documentation (https://platform.openai.com/docs/api-reference/realtime-client-events/response/cancel#realtime-client-events/response/cancel-response_id) says “A specific response ID to cancel - if not provided, will cancel an in-progress response in the default conversation.”
- The server responds:
{"type":"conversation.item.truncated",...,"item_id":"item_AxNzYCRoDo1vCDRPWmWWT","content_index":0,"audio_end_ms":2449}
{"type":"response.audio.done",...,"response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0}"
{"type":"response.audio_transcript.done",...,"response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"transcript":"Once upon a time, ... with the"}
"type":"response.content_part.done",...,"response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"part":{"type":"audio","transcript":"Once upon a time, ... with the"}}"
{"type":"response.output_item.done",...,"response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","output_index":0,"item":{"id":"item_AxNzYCRoDo1vCDRPWmWWT","object":"realtime.item","type":"message","status":"incomplete","role":"assistant","content":[{"type":"audio","transcript":"Once upon a time, ... with the"}]}}"
{"type":"response.done","...,"response":{"object":"realtime.response","id":"resp_AxNzYuSUmeIn7UuMFAwz7","status":"cancelled","status_details":{"type":"cancelled","reason":"client_cancelled"},"output":[{"id":"item_AxNzYCRoDo1vCDRPWmWWT","object":"realtime.item","type":"message","status":"incomplete","role":"assistant","content":[{"type":"audio","transcript":"Once upon a time, ... with the"}]}],"conversation_id":"conv_AxNyZqBJYhtuLWtcSa2J9","modalities":["audio","text"],"voice":"ash","custom_voice_id":null,"output_audio_format":"pcm16","temperature":0.800000011920929,"max_output_tokens":1024,"usage":...}}"
At this point I would expect the audio to stop streaming.
But it does not.
It just keeps coming.
Then, 8 seconds later the server sends:
8. {"type":"output_audio_buffer.audio_stopped","event_id":"event_8230e7b637584b08","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7"}
NOTE that output_audio_buffer.audio_stopped
is not documented anywhere at https://platform.openai.com/docs/api-reference/realtime
I have seen plenty of demos of conversation.item.truncated
working properly, but they are all WebSocket based.
I have yet to find a WebRTC based demo of conversation.item.truncated
showing it working.
I am working on implementing using my own AudioTrack player and queuing incoming audio and flushing it when I see a response.audio.done
server event, but:
- I feel like I must be doing something wrong, because I don’t see any other WebRTC implementations going to this extreme.
- I don’t think flushing all received audio buffers will do much good if the server really is still streaming the audio to me until
output_audio_buffer.audio_stopped
is received.
Does anyone have a WebRTC demo of conversation.item.truncate
working correctly?
I am tempted to write a simple Hello World app to make this more convincing, but I have not gotten around to that yet.
I would obviously ask ChatGPT to help me, but no models, not even o3, have any knowledge of the OpenAI Realtime API introduced 2024/08/01.
I would want to implement it using both WebSocket and WebRTC to A/B test any behavior differences.
Even a basic JavaScript dual impl might be enough to prove if everything is working fine and it must be a problem with either my code or GitHub - webrtc-sdk/android: WebRTC pre-compiled library for android..
The full log (still cut down a bit to not be ridiculously large):
TEXT "tell me a story" SENT AT 2025-02-04 16:55:39.800
2025-02-04 16:55:39.867 dataSendText: message(169 chars TEXT)="{"type":"conversation.item.create","item":{"type":"message","role":"user","content":[{"type":"input_text","text":"tell me a story"}]},"event_id":"evt_aSqFUCkRuUPCaCCbD"}"
2025-02-04 16:55:39.871 onBufferedAmountChange(169)
2025-02-04 16:55:39.901 dataSendText: message(61 chars TEXT)="{"type":"response.create","event_id":"evt_Q1FUYdNkZRnPVcJ7u"}"
...
2025-02-04 16:55:39.937 onDataChannelText: message(280 chars TEXT)="{"type":"conversation.item.created","event_id":"event_AxNzY1OLCtlVOulMMRw82","previous_item_id":null,"item":{"id":"item_AxNzYRPJRjWLWmOf6IcOr","object":"realtime.item","type":"message","status":"completed","role":"user","content":[{"type":"input_text","text":"tell me a story"}]}}"
2025-02-04 16:55:40.039 onDataChannelText: message(431 chars TEXT)="{"type":"response.created","event_id":"event_AxNzYpogVKMb5F9UBpJLj","response":{"object":"realtime.response","id":"resp_AxNzYuSUmeIn7UuMFAwz7","status":"in_progress","status_details":null,"output":[],"conversation_id":"conv_AxNyZqBJYhtuLWtcSa2J9","modalities":["audio","text"],"voice":"ash","custom_voice_id":null,"output_audio_format":"pcm16","temperature":0.800000011920929,"max_output_tokens":1024,"usage":null,"metadata":null}}"
2025-02-04 16:55:40.546 onDataChannelText: message(229 chars TEXT)="{"type":"rate_limits.updated","event_id":"event_AxNzZTaNq8hp6eSh45QKH","rate_limits":[{"name":"requests","limit":1000,"remaining":999,"reset_seconds":86.4},{"name":"tokens","limit":40000,"remaining":38387,"reset_seconds":2.419}]}"
2025-02-04 16:55:40.554 onDataChannelText: message(278 chars TEXT)="{"type":"response.output_item.added","event_id":"event_AxNzZ8YG7KM53H0NMDBr6","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","output_index":0,"item":{"id":"item_AxNzYCRoDo1vCDRPWmWWT","object":"realtime.item","type":"message","status":"in_progress","role":"assistant","content":[]}}"
...
2025-02-04 16:55:40.572 onDataChannelText: message(265 chars TEXT)="{"type":"conversation.item.created","event_id":"event_AxNzZidaeBt1amiQTpzVt","previous_item_id":"item_AxNzYRPJRjWLWmOf6IcOr","item":{"id":"item_AxNzYCRoDo1vCDRPWmWWT","object":"realtime.item","type":"message","status":"in_progress","role":"assistant","content":[]}}"
2025-02-04 16:55:40.578 onDataChannelText: message(236 chars TEXT)="{"type":"response.content_part.added","event_id":"event_AxNzZDdy4NgjDYzHTTuXj","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"part":{"type":"audio","transcript":""}}"
2025-02-04 16:55:40.587 onDataChannelText: message(215 chars TEXT)="{"type":"response.audio_transcript.delta","event_id":"event_AxNzZiUCJ9MfytROpvXto","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"delta":"Once"}"
2025-02-04 16:55:40.603 onDataChannelText: message(216 chars TEXT)="{"type":"response.audio_transcript.delta","event_id":"event_AxNzZhgfJ8PMUvr8JaAWg","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"delta":" upon"}"
2025-02-04 16:55:40.606 onDataChannelText: message(213 chars TEXT)="{"type":"response.audio_transcript.delta","event_id":"event_AxNzZNPg7WroDe2qfQkO0","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"delta":" a"}"
2025-02-04 16:55:40.619 onDataChannelText: message(216 chars TEXT)="{"type":"response.audio_transcript.delta","event_id":"event_AxNzZsupk1CTDWJcQUwvR","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"delta":" time"}"
...
2025-02-04 16:55:42.920 onDataChannelText: message(216 chars TEXT)="{"type":"response.audio_transcript.delta","event_id":"event_AxNzbp9LIkCGEiJKG04WC","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"delta":" with"}"
2025-02-04 16:55:42.935 onDataChannelText: message(215 chars TEXT)="{"type":"response.audio_transcript.delta","event_id":"event_AxNzbgK8L6hw60wTgf4wI","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"delta":" the"}"
STOP (aka: `conversation.item.truncate` + `response.cancel`) PRESSED AT 2025-02-04 16:55:43
2025-02-04 16:55:43.055 dataSendText: message(149 chars TEXT)="{"type":"conversation.item.truncate","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","content_index":0,"audio_end_ms":2449,"event_id":"evt_636dWnY4wA5Z4EobD"}"
2025-02-04 16:55:43.056 onBufferedAmountChange(149)
2025-02-04 16:55:43.062 dataSendText: message(61 chars TEXT)="{"type":"response.cancel","event_id":"evt_Y4rqqH4yVHG7gSLcW"}"
2025-02-04 16:55:43.065 onBufferedAmountChange(61)
2025-02-04 16:55:43.134 onDataChannelText: message(156 chars TEXT)="{"type":"conversation.item.truncated","event_id":"event_AxNzcdNie3TV8osKNjN4Q","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","content_index":0,"audio_end_ms":2449}"
2025-02-04 16:55:43.145 onDataChannelText: message(188 chars TEXT)="{"type":"response.audio.done","event_id":"event_AxNzcJS2Ny42isx9esvcE","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0}"
2025-02-04 16:55:43.162 onDataChannelText: message(439 chars TEXT)="{"type":"response.audio_transcript.done","event_id":"event_AxNzct2ea65VRtKlZx74h","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"transcript":"Once upon a time, in a land where the mountains touched the sky and the rivers sang with the voice of the earth, there lived a young wanderer named Elara. Elara had a heart full of curiosity and a spirit that burned with the"}"
2025-02-04 16:55:43.166 onDataChannelText: message(459 chars TEXT)="{"type":"response.content_part.done","event_id":"event_AxNzcGm3PL9tDYI5cVmP5","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","item_id":"item_AxNzYCRoDo1vCDRPWmWWT","output_index":0,"content_index":0,"part":{"type":"audio","transcript":"Once upon a time, in a land where the mountains touched the sky and the rivers sang with the voice of the earth, there lived a young wanderer named Elara. Elara had a heart full of curiosity and a spirit that burned with the"}}"
2025-02-04 16:55:43.176 onDataChannelText: message(532 chars TEXT)="{"type":"response.output_item.done","event_id":"event_AxNzcIeHQsUHj5HuSvfuU","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7","output_index":0,"item":{"id":"item_AxNzYCRoDo1vCDRPWmWWT","object":"realtime.item","type":"message","status":"incomplete","role":"assistant","content":[{"type":"audio","transcript":"Once upon a time, in a land where the mountains touched the sky and the rivers sang with the voice of the earth, there lived a young wanderer named Elara. Elara had a heart full of curiosity and a spirit that burned with the"}]}}"
2025-02-04 16:55:43.182 onDataChannelText: message(1109 chars TEXT)="{"type":"response.done","event_id":"event_AxNzcdY0P0jKuGyAi1PYF","response":{"object":"realtime.response","id":"resp_AxNzYuSUmeIn7UuMFAwz7","status":"cancelled","status_details":{"type":"cancelled","reason":"client_cancelled"},"output":[{"id":"item_AxNzYCRoDo1vCDRPWmWWT","object":"realtime.item","type":"message","status":"incomplete","role":"assistant","content":[{"type":"audio","transcript":"Once upon a time, in a land where the mountains touched the sky and the rivers sang with the voice of the earth, there lived a young wanderer named Elara. Elara had a heart full of curiosity and a spirit that burned with the"}]}],"conversation_id":"conv_AxNyZqBJYhtuLWtcSa2J9","modalities":["audio","text"],"voice":"ash","custom_voice_id":null,"output_audio_format":"pcm16","temperature":0.800000011920929,"max_output_tokens":1024,"usage":{"total_tokens":488,"input_tokens":183,"output_tokens":305,"input_token_details":{"text_tokens":183,"audio_tokens":0,"cached_tokens":0,"cached_tokens_details":{"text_tokens":0,"audio_tokens":0}},"output_token_details":{"text_tokens":68,"audio_tokens":237}},"metadata":null}}"
... 8 SECONDS PASS!!
2025-02-04 16:55:51.188 onDataChannelText: message(123 chars TEXT)="{"type":"output_audio_buffer.audio_stopped","event_id":"event_8230e7b637584b08","response_id":"resp_AxNzYuSUmeIn7UuMFAwz7"}"
2025-02-04 16:55:51.189 onDataChannelText: undocumented `output_audio_buffer.audio_stopped`
kthnxbye