Realtime API — Is there a built-in way to get a downloadable “final audio” file for a session?

Hi everyone — I’m building a low-latency training app using OpenAI Realtime over WebRTC and wanted to confirm the best practice for exporting audio at the end of a session.

Context / Architecture (short):

  • Browser connects to Realtime via WebRTC (mic in, assistant audio out; events on the data channel).

  • We update instructions dynamically during the session (and also push a plan + reflection context mid-conversation).

  • For recording, we currently capture a stereo stream in the browser (Mic → Left, Assistant → Right) via a Web Audio graph and MediaRecorder with 5–10s timeslices, upload chunks to S3, and merge server-side at wrap-up.

  • We also mirror utterance metadata (role, text, t_start/t_end) to our backend for timestamped transcripts and caption files (VTT/SRT).

What I’m trying to clarify:

  1. Does Realtime expose a server-side “final audio” artifact (single downloadable file) at end of session, or is streaming only (WebRTC track / response.audio.delta over WS) the intended pattern?

  2. If there’s no built-in final file, is the recommended approach to continue doing client-side recording (and server-side merge), or is there an official alternative (e.g., an export job, server-side recorder, or a sample that writes the assistant audio to object storage)?

  3. Any guidance on formats you recommend for long sessions? (e.g., WebM/Opus vs WAV/PCM) — especially for later concatenation without re-encoding and for accurate lip-sync/captions.

  4. Are there events we should use to delimit utterances for accurate segment boundaries (e.g., end-of-turn vs. end-of-audio), so our transcript timing lines up with the audio reliably?

  5. For long sessions (since Realtime sessions time out), is there a recommended pattern for session renewal/rollover that keeps audio and transcript timelines continuous, or any upcoming feature that would make this easier?

  6. Lastly, any policy/licensing notes we should keep in mind when storing a combined user+assistant recording (e.g., diarization channels, retention)?

Why I’m asking:
We can keep using the browser-side timeslice approach, but before we lock it in, I wanted to check whether there’s a documented or upcoming feature to download/export a single audio file directly from the Realtime session (or an officially supported server-side recipe).

Environment:

  • Client: WebRTC (Chrome), Web Audio + MediaRecorder (timeslice 5–10s), RTCDataChannel for events

  • Server: Node/Fastify for signaling + ingest; S3 for chunks and final asset; Lambda for merge + VTT/SRT

  • Use case: Sales role-play training; dynamic instruction updates; plan/reflection context

Thanks in advance! If there’s a canonical doc or sample that covers this (or a roadmap item), a link would be super helpful.

1 Like

This is definitely needed. With WebRTC connections happening entirely in the user’s browser, there’s no way to maintain conversation records.

Transcription isn’t accurate enough to be sure what the user said, and you have to have the user’s browser sends transcript records to an endpoint, because it’s a direct connection.

As for audio, expecting the user’s browser to upload the audio is fraught with issues, like slowing the user’s connection down or just failing due to some block.

We need a way to just get a stored audio record of the conversation. OpenAI could charge for it at the normal storage rate.