OpenAI Realtime How to correctly truncate a live streaming conversation on speech interruption Twilio media streams

I’m implementing a real-time voice call flow using Twilio Media Streams connected to a bidirectional WebSocket service that generates and streams audio responses back to the caller.

The call flow is roughly:

  1. Caller audio is streamed to my backend via Twilio media events (with timestamps).

  2. My backend streams synthesized audio responses back to Twilio.

  3. I also receive an event (input_audio_buffer.speech_started) that indicates the caller has started speaking while audio is still being played (barge-in).

My Issue is:

I’m trying to truncate the currently playing response at the exact moment the caller starts speaking.

I do detect the interruption correctly using a speech-started event, and I immediately send a truncate instruction that includes:

  • The ID of the currently playing response

  • An audio_end_ms value computed from Twilio media timestamps

However, the truncation does not occur where the interruption actually happened. Instead, it appears to truncate the response at an earlier point in the conversation, as if the audio timeline is out of sync.

In other words:

  • The speech-started event fires at the correct moment

  • The truncate command is sent immediately

  • But the truncated audio does not match the real interruption point heard by the caller

For example:

System: Yes, I can help you place your order. Please tell me what you would like.
Customer: I want bagels and a soda, thanks.
System: That will be $9.05. Cash or credit?
(At this point, the customer interrupts.)
(Additional audio/text is generated by the system but is never sent to Twilio because the audio stream is cleared — e.g., “There is a store discount you can apply now.”)
Customer: Sorry for interrupting — do you also have coffee?
System: Sure. What would you like to order?

The issue is that, although the interruption happens after the customer mentions the items to order, when truncation is applied the conversation appears to be truncated earlier than expected, removing part of the already-played turn. This results in the conversation state becoming inconsistent and is now a critical issue in our call flow.

This is my code snip

# Speech interruption detected
if event_type == "input_audio_buffer.speech_started":
    elapsed_ms = current_twilio_timestamp - response_start_timestamp

    send({
        "type": "conversation.item.truncate",
        "item_id": last_response_id,
        "content_index": 0,
        "audio_end_ms": elapsed_ms
    })

Timestamps are tracked from incoming Twilio media events:

if event == "media":
    latest_timestamp_ms = media["timestamp"]

What I’m trying to understand

  1. What is the correct reference timeline for audio_end_ms?
    Should it be:

    • Based on Twilio media timestamps?

    • Based on when outbound audio was sent?

    • Based on some internal playback clock?

  2. Is it expected that speech-started events are slightly delayed relative to audio playback?
    If so, how should truncation be compensated?

  3. Is there a recommended ordering or synchronization strategy when handling:

    • Streaming outbound audio

    • Receiving inbound speech detection

    • Issuing truncate / cancel instructions

  4. Should the truncate instruction be sent immediately on speech detection, or should outbound audio first be flushed/cleared at the media-stream level?

Constraints

  • This is a live call (no buffering the full response first)

  • Truncation must reflect what the caller actually heard, not what was generated

  • I want to avoid logging or replaying partial responses incorrectly

Any guidance on how to properly synchronize interruption handling and truncation in a real-time streaming call setup would be greatly appreciated.

I did not design this logic myself. I implemented it based on a pull request published by Twilio
Twilio AI Interruption

, but the issue still persists. A similar problem is discussed in an OpenAI community thread, which suggests this may be a known limitation or synchronization issue rather than an implementation error.

realtime-api-interruptions-dont-properly-trim-the-transcript

I’ll be so thankful if you can help me out with this

I am integrating Twilio Media Streams with a real-time speech generation and transcription service to handle incoming phone calls (e.g., a bakery ordering system).

The main challenge I am facing is reconstructing the correct conversation history, specifically what the caller actually heard, when interruptions (barge-in) occur.

Issue:

In a real-time call, the speech service continuously emits:

  • Audio output chunks

  • Transcript deltas for that audio

However, transcript deltas are emitted even if the audio is never fully played, or if playback is interrupted and cleared on the Twilio side.

When the caller interrupts, I detect this using a speech-start signal (e.g., input_audio_buffer.speech_started) and immediately:

  1. Truncate the currently playing output

  2. Clear the Twilio audio stream

  3. Resume listening to the caller

Despite this, the transcript stream continues to emit deltas for content that was generated but never heard.

As a result, the transcript history becomes incorrect and no longer reflects what the caller actually experienced.

Example

Conversation flow:

System: Yes, I can help you place your order. What would you like?
Customer: I want bagels and a soda.
System: That will be $9.05. Cash or credit?
(Customer interrupts here)
(Customer never hears the rest of the sentence)
Customer: Sorry for interrupting — do you also have coffee?
System: Sure. What would you like to order?

What actually gets generated internally (undesired)

System: That will be $9.05. Cash or credit? We also have a store discount you can apply now.

The final sentence is generated and transcribed, but never played to the caller because the stream was cleared.

What I want to actually get:

I am not trying to collect all generated transcript deltas.

I specifically need to reconstruct: Only the text corresponding to audio that was actually played to the caller by Twilio

This includes handling cases where:

  • Output is partially played

  • Output is truncated mid-sentence

  • The caller interrupts while audio is playing

Current Code

When caller starts speaking (barge-in)

Audio output is sent to Twilio like this:

---------------------------------------------

AGENT AUDIO → TWILIO

---------------------------------------------

if rtype in (“response.output_audio.delta”, “response.audio.delta”):
agent_cutoff = False

delta = response.get("delta")
if not delta or not stream_sid:
    continue

await websocket.send_json({
    "event": "media",
    "streamSid": stream_sid,
    "media": {"payload": delta},
})

last_audio_ts = time.monotonic()

if response_start_timestamp_twilio is None:
    response_start_timestamp_twilio = latest_media_timestamp

item_id = response.get("item_id")
if item_id:
    last_assistant_item = item_id

await send_mark(websocket, stream_sid)
continue


Each audio chunk sent to Twilio is followed by a mark so playback can be acknowledged later.

Interruption Handling

When the caller starts speaking, I detect it and immediately truncate output and clear Twilio playback:

await openai_ws.send(json.dumps({
“type”: “conversation.item.truncate”,
“item_id”: last_assistant_item,
“content_index”: 0,
“audio_end_ms”: elapsed_ms,
}))

await websocket.send_json({
“event”: “clear”,
“streamSid”: stream_sid,
})


elapsed_ms is computed using Twilio media timestamps.

Transcript Handling

Transcript deltas arrive continuously, even if audio is later truncated or cleared.
To avoid logging text that was never played, I currently attach transcript deltas to the most recent unacknowledged mark:



---------------------------------------------

OUTPUT TRANSCRIPT DELTA

---------------------------------------------

if rtype == “response.output_audio_transcript.delta”:
if agent_cutoff or not mark_queue:
continue

delta = response.get("delta")
if not delta:
    continue

active_mark_id = mark_queue[-1]
pending_agent_transcripts.setdefault(active_mark_id, []).append(delta)
continue




Despite this, transcript deltas continue to arrive for audio that was never played.
Question

What is the correct or recommended way to reconstruct a reliable conversation transcript in this scenario? How can I reliably determine which transcript segments correspond to audio actually played?

I am following an official Twilio example that uses playback markers and interruption detection, but the issue persists on this PR: Initial add AI interruption/conversation truncation. AI talks first. by pkamp3 · Pull Request #13 · twilio-samples/speech-assistant-openai-realtime-api-python · GitHub

Thank you so so much