How to setup transcription on Realtime API with SIP

Pedro21 · August 28, 2025, 11:04pm

Hi all,

I’ve been testing the new Realtime API with SIP integration over Twilio, the realtime conversation part works just fine. However, I haven’t figured out how to get the transcription for the audio input. I can get the full transcription of the model’s response, but I’m unable to retrieve the transcription from the user .

This is the only event related to the transcription that I receive:

{"id":"item_C9fNHe56u8NI1EJTiv4Q9","type":"message","status":"completed","role":"user","content":[{"type":"input_audio","transcript":null}]}}

I’ve tried sending the session.update event to set up the transcription model, but this doesn’t seem to work.

system_update = {
    "type": "session.update",
    "session": {
        "input_audio_transcription": {
            "model": "gpt-4o-transcribe",
            "language": "es",
            "prompt": "",
        },
    }
}

And I got the following error:

"error":{"type":"invalid_request_error","code":"missing_required_parameter","message":"Missing required parameter: 'session.type'.","param":"session.type","event_id":null}

I’ve been testing with the script provided in the documenation: https://platform.openai.com/docs/guides/realtime-sip

Does the Realtime API with SIP supports transcription from the user audio? Or am I doing something wrong? Thanks in advance

rpendleton · August 30, 2025, 5:27pm

I’ve found that I have to specify session.type as realtime whenever I want to send a session.update event. I’ve also found that updating certain properties, like tools, results in audio completely breaking and turning into static, but I can fix that by re-specifying the audio input and output format.

I just posted about that here:

juberti · September 4, 2025, 11:44pm

you’ll need to specify type if you don’t use the openai-beta: realtime=v1 header. The GA (non-beta) session format is also different than the beta format, transcription is now specified in a different place.

john.st · September 8, 2025, 7:12am

Hi, any chance you could point to where in the docs we can find current syntax to include transcription of the input audio on the realtime SIP endpoint please? I also would like to get this working.

juberti · September 8, 2025, 5:15pm

see the docs on the Session object, e.g., in session.update

I’ll also add this into hello-realtime sometime this week.

andrew.romeo · September 10, 2025, 10:45pm

where is transcription specified now? I do not see it any the docs anywhere. Thanks

juberti · September 11, 2025, 5:43am

see the link above, or the hello-realtime code that configures this

andrew.romeo · September 11, 2025, 7:22pm

this is the event I receive:

Received from WebSocket: {“type”:“conversation.item.input_audio_transcription.failed”,“event_id”:“event_CEh5kSVR7ytnWX8KFvea5”,“item_id”:“item_CEh5iYIXABFS0JpboGtpV”,“content_index”:0,“error”:{“type”:“server_error”,“code”:null,“message”:“Input transcription failed for item ‘item_CEh5iYIXABFS0JpboGtpV’. 403 Forbidden”,“param”:null}}

I have tried ‘whisper-1’ and ‘get-4o-transcribe’ models. this is the accept payload I am using:

call_accept = {

"type": "realtime",

“instructions”: “Your name is Janet. You are a helpful assistant”,

“model”: “gpt-realtime-2025-08-28”,

“audio”: {

“input”: {

“transcription”: {“model”:“whisper-1”}

},

“output”: {“voice”:“alloy”}

juberti · September 11, 2025, 9:28pm

ok interesting, we’ve seen this before. What billing tier are you? https://platform.openai.com/settings/organization/limits

andrew.romeo · September 12, 2025, 10:41am

I am currently usage tier 1.

Jan_Ligudzinski · September 12, 2025, 3:47pm

By the way, what can we pass to the /accept endpoint? All I could glean from the SIP tutorial for my Rust project is something like:

#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct AcceptCallRequest {
    /// This is *always* `realtime`. Convenience constructor exposed to ensure this.
    #[serde(rename = "type")]
    pub session_type: RealtimeCallSessionType,
    pub instructions: String,
    pub model: RealtimeModel,
}

But it seems like we can pass everything we could pass in a session.update client event once connected?

juberti · September 12, 2025, 11:10pm

Yes, everything from session.update but also model.

Muhammad_Hammad_Sani · September 17, 2025, 3:32am

 "audio": {
            "input": {
                "transcription": {"model":"whisper-1"}
                },
            "output": {
                "voice": VOICE_NAME}
        },

but getting

Received Event: {
  "type": "conversation.item.done",
  "event_id": "event_CGdBIYGTuz8VzXGNcgpQX",
  "previous_item_id": "item_CGdBGi1yUL8Q4BlNnd4Eg",
  "item": {
    "id": "item_CGdBGBiPPt0yx0xkj5OdV",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "transcript": null
      }
    ]
  }

@rpendleton @juberti can anybody assist, using SIP routing

juberti · September 17, 2025, 5:24pm

if you can post your session ID (sess_xxxx) we can take a closer look.

Eugenio_Rodriguez · September 18, 2025, 6:35pm

Hi team,

I’m implementing SIP calls with the OpenAI Realtime API and experiencing an issue with audio
transcription configuration.

Working setup:

Model: gpt-4o-mini-realtime-preview-2024-12-17
SIP calls work perfectly with basic config
(AI assistant) responds normally

Problem:
When I add input_audio_transcription to the call accept configuration, calls hang up immediately after
being accepted (status 200).

Configuration that causes hangup:

{
  "type": "realtime",
  "model": "gpt-4o-mini-realtime-preview-2024-12-17",
  "instructions": "You are Jessica...",
  "input_audio_transcription": {
    "model": "whisper-1",
    "language": "es"
  }
}

Questions:
1. Does gpt-4o-mini-realtime-preview-2024-12-17 support input_audio_transcription for SIP calls?
2. Should I use gpt-4o-transcribe models instead for SIP transcription?
3. Are there specific headers required for transcription in SIP calls?
4. Any known limitations with transcription + SIP integration?

The same configuration works in the documentation examples, but causes immediate hangups in SIP calls.

Any guidance would be greatly appreciated or any other way to make transcription work! thanks.

juberti · September 19, 2025, 5:39pm

you’re using the beta session format without the beta header. you’ll either need to specify the beta header or update to the GA session format (audio → input → transcription)

Eugenio_Rodriguez · September 19, 2025, 6:08pm

Thank you very much Jubert! GA session format worked perfectly!

josh31 · September 23, 2025, 5:02am

@juberti Would you mind confirming that this worked for you and allowed you to transcribe user input audio? Much appreciated!

josh31 · September 23, 2025, 7:53am

SIP Transcription Solution:

The Challenge

I needed to implement real-time voice transcription for SIP calls using OpenAI’s Realtime API. The main challenge was getting caller speech transcribed and saved properly while maintaining a stable SIP connection for voice calls.

The Key Breakthrough: Session Configuration

The most critical aspect was getting the session configuration right. I discovered that OpenAI’s SIP integration requires very specific session parameters:

const sessionConfig = {
  type: 'session.update',
  session: {
    type: 'realtime', // CRITICAL: Must specify session type
    audio: {
      input: {
        // Format must match actual audio received (G.711 μ-law from SIP)
        transcription: {
          model: 'whisper-1' // GA format, not beta
        }
      }
    }
  }
};

The Multi-Layered Solution

Since OpenAI’s realtime transcription wasn’t reliably working with SIP, I implemented a fallback transcription system:

Primary: OpenAI Realtime API transcription via WebSocket events
Fallback: Capture raw G.711 μ-law audio → convert to PCM → upsample to 24kHz → send to OpenAI Whisper API

Technical Implementation

SIP Integration: Used OpenAI’s /v1/realtime/calls/{callId}/accept endpoint
Audio Format Handling: G.711 μ-law (8kHz) from SIP → PCM (24kHz) for Whisper
Event Deduplication: Implemented processedItemIds Set to prevent duplicate transcripts
Noise Filtering: Added logic to filter out test patterns and meaningless audio

Session Update Timing

The timing of session updates proved crucial:

Send session.update immediately on WebSocket connection
Include session.type: 'realtime' (required for SIP)
Use GA format (audio.input.transcription) not beta format
Match audio formats between session config and actual SIP audio

The Result

I now have a working system that:

Maintains stable SIP voice connections
Captures caller transcripts through multiple pathways
Saves clean, deduplicated conversation logs to Firebase
Handles audio format mismatches gracefully
Provides real-time transcription for call screening

The key lesson: session configuration is everything when working with OpenAI’s SIP integration. Getting the format, timing, and parameters exactly right made the difference between a broken system and a working transcription pipeline.

Only downside is being charged for the transcription…cost is minimal though.

I hope this helps someone!

juberti · September 23, 2025, 9:11pm

you should be able to send this to /accept, no session.update needed

Topic		Replies	Views
Input_audio_transcription in realtime-api API	5	5225	February 20, 2025
Issues with Transcription in Realtime Model Using WebRTC Bugs realtime	15	1838	April 30, 2025
Extracting Transcription Without Using input_audio.input_transcription in OpenAI API API realtime , api-realtime	10	711	March 11, 2025
Use new model for realtime audio transcription API transcribe	6	3489	June 5, 2025
[Realtime API] Input audio transcription is not showing Bugs realtime	12	4441	July 3, 2025