Asterisk + OpenAI Realtime SIP – call connects but no audio

Hi all,

I’m testing an integration between Asterisk and OpenAI Realtime SIP API. The call connects successfully, but I don’t get any audio back from OpenAI.

Here’s the flow:

  1. Asterisk sends INVITE to OpenAI.

  2. OpenAI calls my Webhook.

  3. Webhook responds with Accept the call using JSON:

{
“audio”: {
“input”: {
“format”: “auto”
},
“output”: {
“format”: “auto”,
“voice”: “marin”
}
},
“instructions”: “say hello how are you I am your assistant.”,
“model”: “gpt-realtime”,
“type”: “realtime”
}

  1. OpenAI sends 200 OK – the session starts, but there is no audio.

  2. A few seconds later, the webhook ends the session.

Has anyone here managed to establish a working Asterisk + OpenAI Realtime SIP connection with actual audio?

Any help or working example would be greatly appreciated!

Thanks!

I don’t think auto is a valid format. Either use audio/pcm or audio/pcmuas appropriate

Can a valid JSON be found? I can’t figure out from the webhook whether response.accept should be sent or response.accept+response.create

Is the same!!!
I try with this one.
{
“audio”:{
“input”:{
“format”:“audio/pcmu”
},
“output”:{
“format”:“audio/pcmu”,
“voice”:“marin”
}
},
“instructions”:“You are a helpful IVR assistant. You can answer in English.\nGreet the caller warmly and stay on the line awaiting their response.\nDo not hang up or send a BYE unless the caller explicitly ends the conversation or the system instructs you to end it.\nGuide the caller with short, friendly questions and pause to listen after each response.\nKeep the call active and be patient while waiting for audio from the caller.”,
“model”:“gpt-realtime”,
“response”:{
“conversation”:[
{
“content”:[
{
“text”:“Hello, I’m your IVR assistant. How can I help?”,
“type”:“output_text”
},
{
“audio”:{
“format”:“wav”,
“transcript”:“Hello, I’m your IVR assistant. How can I help?”,
“voice”:“marin”
},
“type”:“output_audio”
}
],
“role”:“assistant”
}
],
“modalities”:[
“audio”
]
},
“type”:“realtime”
}

response.create is only needed if you want the model to generate a response without waiting for the user.

I couldn’t find a sample setup.

The Codex gives the following recommendations.

”We are sending the correct accept request (200 OK, with save_session => true and PCM audio), but the response body is empty and the call is immediately terminated with BYE. It appears Realtime sessions are not enabled for this account, so no session_id is returned. Without a session_id we can’t open a WebSocket or stream audio back, so the model says the initial greeting and closes the call right away.

Could you check whether session persistence is enabled for our project? If possible, please allow the session to be saved (returning session_id in the accept response). Once we receive session_id, we can build a WebSocket client to keep the conversation going instead of closing immediately.”

Please help us.
Right now the conversation is starting and immediately after that we get a BYE from OpenAI

Have you considered using Asterisk AudioSocket and integrating Asterisk with OpenAI’s Realtime WebSocket APIs?

That approach would let you stream audio directly from Asterisk to OpenAI over a persistent WebSocket connection, enabling low-latency bidirectional audio (STS) without adding unnecessary intermediaries. It can simplify the architecture and give you tighter control over session handling and media flow.

Curious to know if you’ve evaluated this setup or if there are specific constraints preventing you from using it.