In the realtime API, when the user interrupts the AI midway through playback of a response, the client sends response.cancel and conversation.item.truncate. The documentation says that the effect of this will be to “synchronize the server’s understanding of the audio with the client’s playback”, and that it will “delete the server-side text transcript to ensure there is not text in the context that hasn’t been heard by the user.”
Therefore, when the server responds to the truncation with a response.audio_transcript.done, my expectation is that the new transcript will contain only the text that corresponds to the audio that the user actually heard before interrupting. Not expecting perfect alignment, but I was expecting something in the rough vicinity.
What I actually get is a transcript which contains a full sentence or two that the user didn’t hear, often 5-10 seconds’ worth. This is reproducible in the openai-realtime-console demo project that OpenAI provides on Github, with no changes. Simply ask for something that will generate a long response from the AI, i.e. “Give me your thoughts at length on the role of democracy in Ancient Greece”, interrupt it almost immediately, and check out the contents of the response.audio_transcript.done which follows the truncate event.
The net effect of this is that, after an interruption, the AI believes the user heard it say more than was actually played, and so the next thing it says to the user can seem disjointed.
I’ve experimented with fixing this in a janky way by somewhat decreasing the value of the truncate event’s audio_end_ms parameter, but it has little effect.
Remember that the model can generate and send you audio much faster than the decoded sample rate at which it is played. You are listening? It is done.
That makes your event and event timing somewhat meaningless. Timings and latency would have to be maintained and calculated to interpret where in the audio stream you might have interrupted with an event signal of a new create, or you would have to send back a “input truncation sample” of decoded audio to a system where everything context is server-side and you already don’t have the quality management that would be desired just of its length.
Secondly is that the AI doesn’t use the transcript as continuing input; the assistant chat history has voice audio as its context. We see on Chat Completions that placing text as assistant chat history will make it not respond in voice any more due to in-context learning. A voice-only modality is probably not offered because OpenAI themselves have not found method to control what the AI produces to various inputs. It will speak once to text when no assistant message is seen, then no more (likely because they prime the AI with the voice you have selected with undisclosed context). “You only respond in German” has no analog with “you only speak”.
Audio is like a different language to AI, and OpenAI won’t let you be in control of placing it in the assistant role or placing your own delineation between speakers, lest you use that to make it speak differently. That is apparent.
The internal units of audio are tokens. They are an undisclosed convolution codec, and I have yet to even verify it is symmetric with the obfuscation of being able to supply only a past audio response ID as a past assistant response, and billing that could cover 10x the reported token usage.
The token stream of output is what would need to be truncated, long after it is produced or done, and that also may take more than simply arbitrarily cutting off the token feedback, as this already may be packets or have semantics across tokens. You don’t get to play AI token numbers through your own decoder.
So you can see you have a concern, but you have no presentation currently by the API that could allow a remedy.
Your first two paragraphs seem in opposition to the stated capabilities in the documentation. It’s obviously true that the model generates audio and text faster than real time. But look at the API’s description of the conversation.item.truncate operation - it is explicitly said to synchronize the server’s understanding of the audio with the client’s playback. Its most relevant parameter, audio_end_ms, makes no sense unless the API has some ability to understand the timing of audio playback, when provided that information explicitly, and react accordingly. If I’m misunderstanding this, then, what is the correct interpretation of what this event is and what it should do?
Your claims about the transcript make sense, but they don’t hold up in my testing. For example, I gave the AI a prompt in the playground that says something like, “You must tell the user these three important facts. If interrupted, continue your train of thought until all three facts are communicated. No repeating or recapping.”
In my testing, the audio will sound like the following:
AI: "I need to tell you three important facts. First, "
Me: “hello”
AI: “As I was saying, <fact 3>.”
Whereas the transcript looks like the following:
AI: “I need to tell you three important facts. First, <fact 1>. Second, <fact 2>. And third,”
Me: “hello”
AI: “As I was saying, <fact 3>.”
Which is strongly suggestive that whatever context the AI is using is similar to what’s seen in the transcript. So, even if the audio isn’t taking the text transcript into account directly, it is generating from a context which has the same problem the transcript does: it’s not properly truncated when the user interrupts.