[BUG?] SIP Realtime – distorted or missing audio (noise/static) with `gpt-realtime` and `gpt-realtime-mini`

[BUG?] SIP Realtime – distorted or missing audio (noise/static) with gpt-realtime and gpt-realtime-mini

Over the past few days, SIP Realtime calls using either gpt-realtime-mini or gpt-realtime have started producing no audio from the assistant or distorted/static audio.
I’m not sure the exact day this began, but the same setup was working perfectly a few weeks ago with the exact same code and configuration.

The model clearly generates speech internally — logs show response.output_audio_transcript.delta and response.output_audio.done — but the RTP audio that arrives over SIP is just noise or silence.
Both PCMU and PCMA codecs show the same issue, with different voices and ASR models.


Environment

  • Date / Region: October 2025 – South America (Brazil)
  • Models: gpt-realtime-mini, gpt-realtime
  • Integration: SIP Realtime (POST /v1/realtime/calls/{id}/accept + WebSocket)
  • Symptom: The agent generates speech internally, but the RTP stream carries only static or silence.

Steps to Reproduce

  1. Create a SIP Realtime call and send a basic call.accept payload.
  2. Open the WebSocket and send a session.update with one of the payloads below.
  3. The assistant responds normally (transcripts appear), but the SIP side receives either no audio or continuous noise.

Expected Behavior
Assistant audio should arrive clearly over RTP, matching the generated transcript.

Actual Behavior
The RTP packets are sent and received normally — traffic flows in both directions with a payload size around 170 bytes — but the decoded audio is just static or silence.


Minimal Payloads (redacted)

call.accept
{
“type”: “realtime”,
“model”: “gpt-realtime-mini”,
“output_modalities”: [“audio”]
}

session.update – PCMA + gpt-4o-mini-transcribe
{
“type”: “session.update”,
“session”: {
“type”: “realtime”,
“audio”: {
“output”: { “voice”: “alloy”, “format”: { “type”: “audio/pcma” } },
“input”: {
“format”: { “type”: “audio/pcma” },
“turn_detection”: { “type”: “semantic_vad”, “eagerness”: “low” },
“transcription”: {
“language”: “pt”,
“model”: “gpt-4o-mini-transcribe”,
“prompt”: “”
}
}
},
“max_output_tokens”: 4096
}
}

session.update – PCMU + whisper-1
{
“type”: “session.update”,
“session”: {
“type”: “realtime”,
“model”: “gpt-realtime-mini”,
“audio”: {
“output”: { “voice”: “ash”, “format”: { “type”: “audio/pcmu” } },
“input”: {
“format”: { “type”: “audio/pcmu” },
“turn_detection”: { “type”: “semantic_vad”, “eagerness”: “low” },
“transcription”: {
“language”: “pt”,
“model”: “whisper-1”,
“prompt”: “”
}
}
},
“max_output_tokens”: 4096
}
}


Logs (summary)

  • The assistant generates responses normally — response.output_audio_transcript.delta and response.output_audio.done appear.
  • SIP receives RTP packets (around 170 bytes each) going to and from OpenAI, but the audio is static or silent.
  • Example call ID: rtc_0772b78b39be4de4…
  • accept → 200 OK, session.update → applied successfully.

Tests Already Done

  • Switched between PCMU and PCMA (same result).
  • Tested different voices (alloy, ash).
  • Tested different ASR models (gpt-4o-mini-transcribe, whisper-1).
  • Verified ptime = 20 ms, no transcoding, pure µ-law/A-law RTP path.
  • Behavior is intermittent but happens often.

Request
If there have been any recent changes to the SIP Realtime backend, could you please let me know?
The same configuration was working fine not long ago, and nothing has changed on my side.

I would really appreciate any suggestions or workarounds (like codec or ptime configuration) that might help restore normal audio output.

Thanks.

1 Like

WOW. I thought I was the only one! Can you take a look at this @juberti ?

Did something change within the past few days? Sharing my setup for reference:
Using PSTN → Signalwire → OpenAI

this is my payload:

payload = {
        type: 'realtime',
        model: 'gpt-realtime',
        audio: {
          input: {
            transcription: {
              language: 'en',
              model: 'whisper-1',
            },
            turn_detection: {
              type: 'semantic_vad'
            },
            format: 'audio/pcmu'
          },
          output: {
            voice: 'cedar',
            format: 'audio/pcmu'
          }
        },
        instructions: getUnconditionalPrompt(),
        include: ['item.input_audio_transcription.logprobs'],
        tracing: 'auto',
        tool_choice: 'auto',
        tools: [
          {
            type: 'function',
            name: 'make_screening_decision',
            description: 'Make final screening decision after gathering information from caller',
            parameters: {
              type: 'object',
              properties: {
                decision: {
                  type: 'string',
                  enum: ['ALLOW', 'REJECT', 'VOICEMAIL'],
                  description: 'The screening decision',
                },
                reason: {
                  type: 'string',
                  description: 'Reason for the decision',
                },
              },
              required: ['decision', 'reason'],
            },
          },
        ]
      };

@aabreu Are you seeing any logs about OpenAI receiving the input audio? I’m not getting any input or output, but I am seeing that transcription for output is successful.

Hi @josh31. Most of the time no — I also notice that the input audio is missing. The issue is intermittent, and the behavior is quite confusing. Like you, I can see the output transcription being generated, but the received audio is inaudible.

I’ve also had cases where I was speaking, and in the output transcription I saw messages like:

“Your call has a lot of noise, could you repeat that?”.

Interesting, seems like a slightly different issue than mine…sounds like a codecs problem.
Posting my full issue here: [BUG] SIP Realtime API - No Audio Output, Phantom Audio Input (Broken Oct 18-22, 2025) @juberti

don’t set format, it’s not needed when using WebRTC/SIP. We should be ignoring this parameter but evidently in some cases it’s slipping through.

3 Likes

HI Dear @juberti

We are reproducing this issue.

We also do not set the format as you suggested to the authors earlier. But still, the audio At times is distorted.

So, we make a call, OpenAI Answers me distorted and slow.

We drop and redial the same call, same network, same everything, openAI answers me perfectly and clear.

Attached are 2 captures where everything is constant - Environment, devices and api key. The calls were made after each other. dump_ok.pcap and dump_notok.pcap

(Right click on a packet > Decode as RTP and then play the stream.)

this is the URL

https://drive.google.com/drive/folders/1lPV4GZzo9OuoshskpO334TIATc2zuwXS?usp=drive_link

model: gpt-realtime

Any help or advise would be greatly appreciated.

Thank you for the captures. I listened to them and inspected the audio data, somehow silence data is getting mixed into the audio output leading to alternating periods of speech and silence, which is what you are hearing. Next steps:

  • can you verify that this is still occurring as of today?
  • if so, can you open a new issue as the root cause is different than this issue?
  • can you verify that the audio isn’t going through any gateways that might be mangling the audio somehow?
  • can you post some call IDs from the affected calls?
1 Like

Dear @juberti

Thank you so much for your reply on this. Below are answers to your questions:

  1. You asked me if it is still occurring as of today. Yesterday when I posted it was happening. Today I made 10 consecutive calls and all audio is Crystal clear. So Strange. Same build, Same devices, same network.
  2. Agreed. We will follow your advice and open a new topic.
  3. Audio is coming from Open AI straight to the endpoint that captured the capture via webrtc. That stream you played is direct from openAI realtime API. Of course after SRTP decryption otherwise we won’t be able to play it.
  4. We will generate dumps and provide openAI SessionID as filename.

Thank you so much.

It’s possible this resulted from a brief misconfig on our side earlier this week. If it’s no longer reproducing, I suspect that this was the root cause.