Truncating audio will delete the server-side text transcript to ensure there is not text in the context that hasn't been heard by the user.
Should we use the audio_end_ms to know what has been “heard” by the client, and store the transcription of the audio history up until that point? Based on the above, we should delete all response text from our chat history.
Are the instructions meant to state we should delete any server-side text transcript past the truncation point, or should we pretend no response was ever generated? If so, wouldn’t this result in a chat history with two consecutive user messages (the one whose response was interrupted and therefore deleted – and the new utterance that provided the actual interruption)?
The instructions about the truncation method are about the API stored state that is reused, not about the presentation of a transcription you provide to the user. That’s the first thing that I can help clarify.
We’ll call the parallel text a “transcript”, and the prior audio generation the prior audio generation.
Here’s the scenario:
The AI generates audio output tokens fast. Your client plays them slow, however, and you might either relay the websocket stream at a rate similar to its sample rate or might have a client that is timing aware (the exact implementation of interruption time detection is up to you). The server state is a assistant audio perhaps 10x realtime, or completely generated by the time you’d interact with it by further API calls.
So you want the prior audio generation to not appear to the AI as though a complete answer was delivered and understood by the user. If the user says, “that’s not what I was asking about at all”, make the AI understand the abruptness. The input audio for chat history can have the end trimmed off at the point you specify with the API method conversation.item.truncate.
I don’t see how shortening the length of a prior generated AI output audio would affect any other turns you did not specify. You don’t delete anything, you just will clean up the understood generation if you allow interruptions (instead of muting the mic to avoid echo feedback)
What to do about the text transcription? That is a deliverable for you to optionally display. I don’t expect it will be re-run through Whisper for free for you, and there isn’t a parallel “word timestamp” version of the transcript. If you have coded up an experimentation framework (I have not), you could see if conversation.item.retrieve will give you any text transcript alteration a while after doing a truncate, returning conversation.item.retrieved with transcript.
Since you know the new audio length, if you have captured the assistant audio, you could send it through your own whisper call. It could appear like, “The reason that you often experience (AI interrupted)”. (There isn’t a good out-of-band method for interacting with the server state to get the new assistant audio).
Hopefully I’ve understood and answered everything you were unsure about.
Thank you – so my chat history is based on text. You’d described truncating the audio of the response based on the audio_end_ms. To make my history match, I would do this truncation, but then regenerate text for the truncated audio and rewrite my last assistant text to only have what the client heard before they interrupted with the next utterance.