[feature-request] Add duration_ms to Realtime API conversation transcript deltas

Why this is needed

A real-time voice agent must keep live captions in exact lock-step with the audio it streams.
Start-times are already derivable from the audio stream (each frame carries a timestamp or can be placed monotonically in the playback queue).
The missing piece is the true duration of each transcript delta.
Today that duration is guessed (e.g., “ 70 ms × chars ”), which drifts whenever the engine changes speed.


Minimal API change

{
  "type": "response.audio_transcript.delta",
  "item_id": "itm_42",
  "content_index": 0,
  "delta": "Hello there!",
  "duration_ms": 640        // <-- NEW, integer
}
  • Only one extra integer, fully backward-compatible.
  • No start/offset field required—the client derives that from its audio timeline.

How the client would use duration_ms

# ───────────── shared state ─────────────
captions_timeline = []               # ordered, no overlaps
current_stream_offset = 0            # ms, head of caption timeline
item_base_offsets = {}               # first-audio offset per item

record Period(kind, item_id, index, start, duration)

# ───────────── audio handler ─────────────
on_first_audio_frame(item_id, frame_start_ms):
    # Remember where this conversation item begins in the stream
    item_base_offsets[item_id] = frame_start_ms

# ───────── transcription-delta handler ─────────
on_delta(item_id, index_in_item, text, duration_ms):
    base = item_base_offsets[item_id]          # already known
    if index_in_item == 0 and current_stream_offset < base:
        # Add silence before the first caption of the item
        captions_timeline.append(
            Period("silence", null, null,
                   current_stream_offset,
                   base - current_stream_offset)
        )
        current_stream_offset = base

    captions_timeline.append(
        Period("caption", item_id, index_in_item,
               current_stream_offset, duration_ms)
    )
    current_stream_offset += duration_ms
    render_caption(text, start=current_stream_offset - duration_ms,
                   duration=duration_ms)

# ─────────── truncation (seek / lost frames) ───────────
truncate_to(playback_offset_ms):
    # Drop every period that starts at or after the already-played point
    captions_timeline = [p for p in captions_timeline
                         if p.start < playback_offset_ms]
    current_stream_offset = sum(p.duration for p in captions_timeline)

Key properties

  • Timeline is strictly monotonic; captions never overlap.
  • If playback jumps backward, truncate_to() simply chops the tail and replay resumes.
  • With duration_ms from the API, the algorithm is deterministic—no per-voice heuristics, no drift.

Summary

Adding a single duration_ms field to each response.audio_transcript.delta server event lets any client maintain an exact, non-overlapping caption timeline driven directly by the model’s own timing. This eliminates heuristic duration guesses and guarantees that captions remain perfectly aligned to the audio, even across speed changes or stream truncations.

Thank you for considering this improvement.

3 Likes

This is a high-impact improvement that unlocks a major leap in caption accuracy for any real-time voice agent.

The current heuristic-based duration estimates (like 70ms × chars) simply can’t keep up with real-world variations, especially when engine speed shifts or the transcript structure changes. These tiny drifts compound over time, and the end result is captions that feel subtly but noticeably “off,” which breaks the immersive experience we strive for.

Adding a single duration_ms field derived from the same internal timing already available to the engine gives developers deterministic, drift-free control over caption timing. This enables perfect alignment between transcript and audio, across speed shifts, seek events, or packet loss. The simplicity and backward compatibility of the proposal make it even more compelling: just one integer unlocks a whole new level of precision.

This change would dramatically raise the ceiling on what’s possible in real-time captioning and accessibility.

I absolutely need this too.
The current workaround using character-based duration estimates just doesn’t hold up- timing drifts fast with engine speed changes or even slight transcript shifts.

A proper duration_ms field would let us finally get accurate, drift-free captions in real-time.
It’s such a small change but would make a huge difference.
Really hoping this gets prioritized.

Bringing this issue to the forefront again: incorporating duration metadata into real-time transcription would significantly enhance its value, especially for captioning workflows. Without this crucial feature, delivering accurate, user-friendly captions becomes a real challenge. Addressing this need could unlock powerful new use cases and greatly improve developer adoption and satisfaction. Let’s make real-time transcription truly seamless and impactful together!