[BUG?] SIP Realtime – distorted or missing audio (noise/static) with gpt-realtime and gpt-realtime-mini
Over the past few days, SIP Realtime calls using either gpt-realtime-mini or gpt-realtime have started producing no audio from the assistant or distorted/static audio.
I’m not sure the exact day this began, but the same setup was working perfectly a few weeks ago with the exact same code and configuration.
The model clearly generates speech internally — logs show response.output_audio_transcript.delta and response.output_audio.done — but the RTP audio that arrives over SIP is just noise or silence.
Both PCMU and PCMA codecs show the same issue, with different voices and ASR models.
Environment
- Date / Region: October 2025 – South America (Brazil)
- Models: gpt-realtime-mini, gpt-realtime
- Integration: SIP Realtime (POST /v1/realtime/calls/{id}/accept + WebSocket)
- Symptom: The agent generates speech internally, but the RTP stream carries only static or silence.
Steps to Reproduce
- Create a SIP Realtime call and send a basic call.accept payload.
- Open the WebSocket and send a session.update with one of the payloads below.
- The assistant responds normally (transcripts appear), but the SIP side receives either no audio or continuous noise.
Expected Behavior
Assistant audio should arrive clearly over RTP, matching the generated transcript.
Actual Behavior
The RTP packets are sent and received normally — traffic flows in both directions with a payload size around 170 bytes — but the decoded audio is just static or silence.
Minimal Payloads (redacted)
call.accept
{
“type”: “realtime”,
“model”: “gpt-realtime-mini”,
“output_modalities”: [“audio”]
}
session.update – PCMA + gpt-4o-mini-transcribe
{
“type”: “session.update”,
“session”: {
“type”: “realtime”,
“audio”: {
“output”: { “voice”: “alloy”, “format”: { “type”: “audio/pcma” } },
“input”: {
“format”: { “type”: “audio/pcma” },
“turn_detection”: { “type”: “semantic_vad”, “eagerness”: “low” },
“transcription”: {
“language”: “pt”,
“model”: “gpt-4o-mini-transcribe”,
“prompt”: “”
}
}
},
“max_output_tokens”: 4096
}
}
session.update – PCMU + whisper-1
{
“type”: “session.update”,
“session”: {
“type”: “realtime”,
“model”: “gpt-realtime-mini”,
“audio”: {
“output”: { “voice”: “ash”, “format”: { “type”: “audio/pcmu” } },
“input”: {
“format”: { “type”: “audio/pcmu” },
“turn_detection”: { “type”: “semantic_vad”, “eagerness”: “low” },
“transcription”: {
“language”: “pt”,
“model”: “whisper-1”,
“prompt”: “”
}
}
},
“max_output_tokens”: 4096
}
}
Logs (summary)
- The assistant generates responses normally — response.output_audio_transcript.delta and response.output_audio.done appear.
- SIP receives RTP packets (around 170 bytes each) going to and from OpenAI, but the audio is static or silent.
- Example call ID: rtc_0772b78b39be4de4…
- accept → 200 OK, session.update → applied successfully.
Tests Already Done
- Switched between PCMU and PCMA (same result).
- Tested different voices (alloy, ash).
- Tested different ASR models (gpt-4o-mini-transcribe, whisper-1).
- Verified ptime = 20 ms, no transcoding, pure µ-law/A-law RTP path.
- Behavior is intermittent but happens often.
Request
If there have been any recent changes to the SIP Realtime backend, could you please let me know?
The same configuration was working fine not long ago, and nothing has changed on my side.
I would really appreciate any suggestions or workarounds (like codec or ptime configuration) that might help restore normal audio output.
Thanks.