Certain session properties result in the call turning to static with gpt-realtime and SIP

Problem Description

When accepting SIP calls via /v1/realtime/calls/{CallId}/accept or sending session.update events on the call’s WebSocket connection, including certain properties (such as tools) unexpectedly causes audio to degrade to static. This occurs even when these properties are unrelated to audio configuration.

Steps to Reproduce

Working Configuration

Accepting SIP calls with this minimal configuration works correctly:

const acceptResponse = {
  type: "realtime",
  model: "gpt-realtime",
  instructions: "You're a friendly bot talking to someone on a phone call.",
};

Failing Configuration

Adding a tools array causes audio static:

const acceptResponse = {
  type: "realtime",
  model: "gpt-realtime",
  instructions: "You're a friendly bot talking to someone on a phone call.",

  tools: [
    {
      type: "function",
      name: "end_call",
      description: "Ends the call upon the completion of the current response. This should only be called after the agent has said bye to the caller.",
      parameters: {
        type: "object",
        properties: {},
        required: [],
      },
    },
  ]
};

Alternative Reproduction Method

The same issue occurs when sending session.update events:

  1. Accept call with minimal configuration (audio works fine)
  2. Send session.update event with session.tools property
  3. Audio immediately degrades to static

Side note: When updating the session, it seems like you also have to specify session.type as realtime or the update is ignored.

Current Workaround

Interestingly, explicitly including audio configuration prevents the static issue.

I don’t love this though, since I’m unsure whether the following is the ideal audio configuration. I’m also unsure whether anything else from audio needs to be re-specified, such as my preferred turn_detection.

const acceptResponse = {
  type: "realtime",
  model: "gpt-realtime",
  instructions: "You're a friendly bot talking to someone on a phone call.",
  tools: [ /* ...tool definitions... */ ],

  audio: {
    input: {
      format: { type: "audio/pcmu" },
      turn_detection: { type: "server_vad" }
    },
    output: {
      format: { type: "audio/pcmu" }
    }
  },
};

Expected Behavior

My main concern is around the unclear behavior and format of the /v1/realtime/calls/{CallId}/accept endpoint and the session.update client event.

For the accept endpoint, is there a reason audio only has to be included if I specify certain properties like tools? Why do I not need to include it when using a simpler prompt or instructions configuration?

For session.update client events, are those considered patches, or complete replacements? If they’re complete replacements, it seems odd that I could accidentally lose configuration that I didn’t specify when accepting the call to begin with. My initial instinct was that session.update was actually just applying patches though.

I imagine there’s room to improve this from both a documentation and a behavior point of view.

1 Like

It looks like this may have been fixed between when I found the issue yesterday afternoon and now, although I haven’t tested very thoroughly to confirm. And in either case, documentation improvements on patch vs replace would be helpful.

Hi Ryan, excellent contribution.

I wanted to ask if you’re using Twilio or a direct SIP?

I’m trying to use SIP following the example at https://platform.openai.com/docs/guides/realtime-sip

When the request arrives at the webhook, I accept the call, but when I try to connect to the wbsocket, I always get “WebSocket error: server rejected WebSocket connection: HTTP 404”

Thanks in advance.

I’m using a number registered through Twilio, but I’m using it through their Elastic SIP Trunking feature (instead of their Programmable Voice feature). So I don’t have to do anything with TwiML to accept the call, nor do I have to proxy audio between the two like was required before OpenAI had SIP endpoints.

I’m seeing the same thing with the payload in the accept causing static audio via twilio sip. I was able to get it going with your samples. Thank you!

Feels buggy that static would be presented, even if the accept payload was outright wrong.

the static is caused by clobbering of the audio formats and subsequent mishandling of PCMU vs PCM data. (We shouldn’t let you make this error, but that’s what’s happening).

To your other question, session.updates are patches, not replacements.

2 Likes

I’m having the exact same issue using Twilio’s Elastic SIP Trunking. However the workaround doesn’t work for me.

The payload (in Python format) that I’m sending to "https://api.openai.com/v1/realtime/calls/"+event.data.call_id+"/accept" is below and I still get static upon calling the SIP number.

call_accept = {
    "type": "realtime",
    "model": "gpt-realtime",
    "output_modalities": ["audio"],
    "audio": {
        "output": {
            "format": { "type": "audio/pcmu" },
        },
        "input": {
            "format": { "type": "audio/pcmu" },
            "turn_detection": { "type": "server_vad" }
        },
    },
    "tools": [
        {
            "type": "web_search",
        }
    ],
    "tool_choice": "auto",
    "instructions": "You are a support agent.",
}

No matter what format (or voice) is specified, the response.created event received on the websocket specifies "audio":{"output":{"format":{"type":"audio/pcm","rate":24000},"voice":"alloy"}

I have tried specifying input/output format types audio/pcma and audio/pcm (with "rate": 24000) but they all result in static.

This is quite easy to reproduce. You follow the guide here https://platform.openai.com/docs/guides/realtime-sip#handle-the-webhook and then simply add a tool to the call_accept payload.

I just deployed an update that if the initial /accept call is invalid we disconnect the call. So you will no longer get the static on the call.

To see the error you can connect via WebSocket and it will be viewable over that.

I will work on preventing codecs getting changing mid-call!

@jk7 Can you try sending that session.update via WebRTC or Websocket and see if it works. A couple things I noticed

  • You don’t need to send your audio block at all.
  • That tool definition I don’t believe is right. You need description parameters name and type. I bet if you connected to the WebSocket for your SIP session you would get some debug info!
2 Likes

Thanks for the details!

I’m still experiencing static when accepting calls with incorrect requests in some scenarios.

For example, I recently tried to restrict the rights of the API key I was using, but accidentally removed the rights to read the stored prompt by ID. After trying to accept a call just now using that prompt ID, I saw an error event in the WebSocket (which was actually very helpful), but the call wasn’t terminated and was replaced with static instead.

Here’s the error with an event_id if it’s helpful:

{
  "type": "error",
  "event_id": "event_CCrHPvWbJ23q91GKSQHGC",
  "error": {
    "type": "invalid_request_error",
    "code": null,
    "message": "Unauthorized to fetch prompt with id '{prompt_id}'.",
    "param": null,
    "event_id": null
  }
}

Thanks @Sean-Der the update is working and validating the /accept call correctly. The WebSocket is reporting the error as an event:

Received from WebSocket: {"type":"error","event_id":"event_CD6r3NxYHO6GL3L9AWuNc","error":{"type":"invalid_request_error","code":"invalid_value","message":"Invalid value: 'web_search'. Supported values are: 'function' and 'mcp'.","param":"session.tools[0].type","event_id":"initial_session_update"}}

I did not realize that tools such as web_search and file_search are not available in the realtime API. Thanks for the update - this will make the experience better when developing via the SIP interface.

Hi guys, trying to set voice in the accept response also seems to cause audio to fail, is that a known issue? Or am I doing something wrong?

Setting voice in accept works, check out https://hello-realtime.val.run for an example.

1 Like

Hi, thanks. Are you sure that’s using realtime SIP? Wouldn’t it be using WebRTC since it’s web browser based?

I can confirm I can change voices in SIP successfully.

@john.st if you click the “View source” link at the top left of @juberti 's linked web app you can see the README file that explains how to call the number via SIP, and all of the source code.

1 Like

Aha .. thank you. I didn’t see that ‘View Source’ link. Appreciate it.