Why this is needed
A real-time voice agent must keep live captions in exact lock-step with the audio it streams.
Start-times are already derivable from the audio stream (each frame carries a timestamp or can be placed monotonically in the playback queue).
The missing piece is the true duration of each transcript delta.
Today that duration is guessed (e.g., “ 70 ms × chars ”), which drifts whenever the engine changes speed.
Minimal API change
{
"type": "response.audio_transcript.delta",
"item_id": "itm_42",
"content_index": 0,
"delta": "Hello there!",
"duration_ms": 640 // <-- NEW, integer
}
- Only one extra integer, fully backward-compatible.
- No start/offset field required—the client derives that from its audio timeline.
How the client would use duration_ms
# ───────────── shared state ─────────────
captions_timeline = [] # ordered, no overlaps
current_stream_offset = 0 # ms, head of caption timeline
item_base_offsets = {} # first-audio offset per item
record Period(kind, item_id, index, start, duration)
# ───────────── audio handler ─────────────
on_first_audio_frame(item_id, frame_start_ms):
# Remember where this conversation item begins in the stream
item_base_offsets[item_id] = frame_start_ms
# ───────── transcription-delta handler ─────────
on_delta(item_id, index_in_item, text, duration_ms):
base = item_base_offsets[item_id] # already known
if index_in_item == 0 and current_stream_offset < base:
# Add silence before the first caption of the item
captions_timeline.append(
Period("silence", null, null,
current_stream_offset,
base - current_stream_offset)
)
current_stream_offset = base
captions_timeline.append(
Period("caption", item_id, index_in_item,
current_stream_offset, duration_ms)
)
current_stream_offset += duration_ms
render_caption(text, start=current_stream_offset - duration_ms,
duration=duration_ms)
# ─────────── truncation (seek / lost frames) ───────────
truncate_to(playback_offset_ms):
# Drop every period that starts at or after the already-played point
captions_timeline = [p for p in captions_timeline
if p.start < playback_offset_ms]
current_stream_offset = sum(p.duration for p in captions_timeline)
Key properties
- Timeline is strictly monotonic; captions never overlap.
- If playback jumps backward,
truncate_to()
simply chops the tail and replay resumes. - With
duration_ms
from the API, the algorithm is deterministic—no per-voice heuristics, no drift.
Summary
Adding a single duration_ms
field to each response.audio_transcript.delta
server event lets any client maintain an exact, non-overlapping caption timeline driven directly by the model’s own timing. This eliminates heuristic duration guesses and guarantees that captions remain perfectly aligned to the audio, even across speed changes or stream truncations.
Thank you for considering this improvement.