Response.output_audio.delta does not ever get sent via webrtc or websocket

I’m getting all the other realtime events firing in both webrtc mobile app client and websocket sideband backend but response.output_audio.delta never fires on either.

Presumably this should fire at least once (likely a few times) every time the assistant speaks?

correct. Just to double check a few things, you have an audio output modality, you’re using an audio-capable model, you’ve specified a valid voice and output format?

and are you seeing the response.audio_output.done?

I see response.output_audio.done yes. @mcfinley

yes sorry I transposed that. ok confirm also:

session.update → session → type = realtime

you’re not looking at the beta delta events (which were response.audio.delta)

you’re not getting any response.refusal messages

you don’t have anything limiting the message size on the websocket

the .done messages are coming through after some delay (one or two seconds, as though the model is trying to stream deltas before done) as opposed to immediately after your response.create

finally, your output response.create looks something like this (important values marked, your use case may have other values):

            "type": "response.create",
            "response": {
                "conversation":"auto",           # important
                "instructions": "tell me a story",
                "max_output_tokens": 150,        # important
                "output_modalities": ["audio"],  # important
                "audio": {
                    "output": {
                        "format": {
                            "type": "audio/pcm", # important
                            "rate": 24000        # important
                        },
                        "voice": "marin",        # important
                    }
                }
            }

1 Like
const resp = await openaiClient.realtime.clientSecrets.create({
  expires_after: { anchor: "created_at", seconds: 600 },
  session: {
    audio: {
      output: {
        format: {
          rate: 24000,
          type: "audio/pcm",
        },
        speed: 1,
        voice: "alloy",
      },
    },
    instructions: prompt,
    max_output_tokens: 4096,
    model: "gpt-realtime",
    output_modalities: ["audio"],
    type: "realtime",
  },
});

this is what i’m doing to set up the session. Using all the fields you mentioned.

i also tried it with an explicit response.create like you shared above.

.done is coming through 1-2 seconds later. followed even later by output_audio_buffer.stopped @mcfinley

are you getting response.output_audio_transcript.delta?

The audio is flowing through a separate channel… look at https://platform.openai.com/docs/guides/realtime-conversations#client-and-server-events-for-audio-in-webrtc

I get the audio delta events when I’m using straight websockets, not WebRTC.

if you need access to the audio data blocks, its websockets… from the docs:

Manipulating WebRTC APIs for media streams may give you all the control you need. However, it may occasionally be necessary to use lower-level interfaces for audio input and output. Refer to the WebSockets section below for more information and a listing of events required for granular audio input handling.

Yep, i get dozens of those.

The backend is connected via websockets. The frontend via WebRTC.

For audio in websockets for server events (https://platform.openai.com/docs/guides/realtime-conversations?lang=javascript#handling-audio-with-websockets), “response.output_audio.delta” is listed as event we should get. I get the rest of them in that list, besides this one.

Lower down (https://platform.openai.com/docs/guides/realtime-conversations#working-with-audio-output-from-a-websocket) it also mentions this event as the way to get access to audio data (in order to save it to S3 in my case).

I’m getting audio deltas on server-to-server (headless) audio websocket-only connections.
From what I can tell, you have a hybrid use case where you want a client to receive the audio for playback on the peer track of webRTC AND you want the server to receive it for archival.

Sorry if you know all of this and I’m just out of my depth but If you are connecting to the Realtime API using WebRTC from a client device, Audio output from the model is delivered to your client as a remote media stream per WebRTC MediaStream Tutorial

the browser infra is doing all the signal processing for you on purpose, so you don’t have to. Its but in your case, you need something more.

You are getting control signals that tell you what’s happening (.stopped, .done) on the peer connection so you can coordinate with the UI but you are not getting audio deltas because they are being handled by media track on purpose.

So I think what you want is to peek at the media track as it flows on the client and then push it to your server (or straight to s3). This is supported by WebRTC (not openAI) on the browser side for applications like showing an analyzer or allowing a record button…

const mediaRecorder = new MediaRecorder(remoteStream);
const audioChunks = \[\];

mediaRecorder.ondataavailable = event => {
    audioChunks.push(event.data);
};

mediaRecorder.onstop = () => {
    const audioBlob = new Blob(audioChunks, { type: 'audio/webm' });
    *// Process audioBlob, e.g., send to server, play back, or decode for analysis*
};

mediaRecorder.start();
*// ... later ...*
mediaRecorder.stop();

Sorry if unhelpful… that’s the bottom of my knowledge without more research!

Appreciate it. Sounds like due to the hybrid approach, I won’t get access to those audio deltas on websocket side. Unfortunate but fair enough.

Thank you for helping me dig into this!

@samwilcoxon, in case you are still grappling with this, I was hitting the same problem, and it looks like the audio delta event changed from

response.output_audio.delta

to

response.audio.delta

at some point (possibly with the change to gpt-realtime?). Switching to the latter fixed my problem. (I’m using WebSocket, not WebRTC.) I’d be interested to know if it does for you, too.

This change doesn’t seem to be reflected in the docs, unfortunately.

Oh, weird. Looks like I actually made the opposite change in my application just a month ago, going from audio to output_audio. I wonder if that change got reverted on OpenAI’s end (without a corresponding revert to the docs).