Problem Description
When accepting SIP calls via /v1/realtime/calls/{CallId}/accept or sending session.update events on the call’s WebSocket connection, including certain properties (such as tools) unexpectedly causes audio to degrade to static. This occurs even when these properties are unrelated to audio configuration.
Steps to Reproduce
Working Configuration
Accepting SIP calls with this minimal configuration works correctly:
const acceptResponse = {
type: "realtime",
model: "gpt-realtime",
instructions: "You're a friendly bot talking to someone on a phone call.",
};
Failing Configuration
Adding a tools array causes audio static:
const acceptResponse = {
type: "realtime",
model: "gpt-realtime",
instructions: "You're a friendly bot talking to someone on a phone call.",
tools: [
{
type: "function",
name: "end_call",
description: "Ends the call upon the completion of the current response. This should only be called after the agent has said bye to the caller.",
parameters: {
type: "object",
properties: {},
required: [],
},
},
]
};
Alternative Reproduction Method
The same issue occurs when sending session.update events:
- Accept call with minimal configuration (audio works fine)
- Send
session.updateevent withsession.toolsproperty - Audio immediately degrades to static
Side note: When updating the session, it seems like you also have to specify session.type as realtime or the update is ignored.
Current Workaround
Interestingly, explicitly including audio configuration prevents the static issue.
I don’t love this though, since I’m unsure whether the following is the ideal audio configuration. I’m also unsure whether anything else from audio needs to be re-specified, such as my preferred turn_detection.
const acceptResponse = {
type: "realtime",
model: "gpt-realtime",
instructions: "You're a friendly bot talking to someone on a phone call.",
tools: [ /* ...tool definitions... */ ],
audio: {
input: {
format: { type: "audio/pcmu" },
turn_detection: { type: "server_vad" }
},
output: {
format: { type: "audio/pcmu" }
}
},
};
Expected Behavior
My main concern is around the unclear behavior and format of the /v1/realtime/calls/{CallId}/accept endpoint and the session.update client event.
For the accept endpoint, is there a reason audio only has to be included if I specify certain properties like tools? Why do I not need to include it when using a simpler prompt or instructions configuration?
For session.update client events, are those considered patches, or complete replacements? If they’re complete replacements, it seems odd that I could accidentally lose configuration that I didn’t specify when accepting the call to begin with. My initial instinct was that session.update was actually just applying patches though.
I imagine there’s room to improve this from both a documentation and a behavior point of view.