Hi everyone — I’m building a low-latency training app using OpenAI Realtime over WebRTC and wanted to confirm the best practice for exporting audio at the end of a session.
Context / Architecture (short):
-
Browser connects to Realtime via WebRTC (mic in, assistant audio out; events on the data channel).
-
We update instructions dynamically during the session (and also push a plan + reflection context mid-conversation).
-
For recording, we currently capture a stereo stream in the browser (Mic → Left, Assistant → Right) via a Web Audio graph and MediaRecorder with 5–10s timeslices, upload chunks to S3, and merge server-side at wrap-up.
-
We also mirror utterance metadata (role, text, t_start/t_end) to our backend for timestamped transcripts and caption files (VTT/SRT).
What I’m trying to clarify:
-
Does Realtime expose a server-side “final audio” artifact (single downloadable file) at end of session, or is streaming only (WebRTC track /
response.audio.deltaover WS) the intended pattern? -
If there’s no built-in final file, is the recommended approach to continue doing client-side recording (and server-side merge), or is there an official alternative (e.g., an export job, server-side recorder, or a sample that writes the assistant audio to object storage)?
-
Any guidance on formats you recommend for long sessions? (e.g., WebM/Opus vs WAV/PCM) — especially for later concatenation without re-encoding and for accurate lip-sync/captions.
-
Are there events we should use to delimit utterances for accurate segment boundaries (e.g., end-of-turn vs. end-of-audio), so our transcript timing lines up with the audio reliably?
-
For long sessions (since Realtime sessions time out), is there a recommended pattern for session renewal/rollover that keeps audio and transcript timelines continuous, or any upcoming feature that would make this easier?
-
Lastly, any policy/licensing notes we should keep in mind when storing a combined user+assistant recording (e.g., diarization channels, retention)?
Why I’m asking:
We can keep using the browser-side timeslice approach, but before we lock it in, I wanted to check whether there’s a documented or upcoming feature to download/export a single audio file directly from the Realtime session (or an officially supported server-side recipe).
Environment:
-
Client: WebRTC (Chrome), Web Audio + MediaRecorder (timeslice 5–10s), RTCDataChannel for events
-
Server: Node/Fastify for signaling + ingest; S3 for chunks and final asset; Lambda for merge + VTT/SRT
-
Use case: Sales role-play training; dynamic instruction updates; plan/reflection context
Thanks in advance! If there’s a canonical doc or sample that covers this (or a roadmap item), a link would be super helpful.