How to setup transcription on Realtime API with SIP

Hi all,

I’ve been testing the new Realtime API with SIP integration over Twilio, the realtime conversation part works just fine. However, I haven’t figured out how to get the transcription for the audio input. I can get the full transcription of the model’s response, but I’m unable to retrieve the transcription from the user .

This is the only event related to the transcription that I receive:

{"id":"item_C9fNHe56u8NI1EJTiv4Q9","type":"message","status":"completed","role":"user","content":[{"type":"input_audio","transcript":null}]}}

I’ve tried sending the session.update event to set up the transcription model, but this doesn’t seem to work.

system_update = {
    "type": "session.update",
    "session": {
        "input_audio_transcription": {
            "model": "gpt-4o-transcribe",
            "language": "es",
            "prompt": "",
        },
    }
}

And I got the following error:

"error":{"type":"invalid_request_error","code":"missing_required_parameter","message":"Missing required parameter: 'session.type'.","param":"session.type","event_id":null}

I’ve been testing with the script provided in the documenation: https://platform.openai.com/docs/guides/realtime-sip

Does the Realtime API with SIP supports transcription from the user audio? Or am I doing something wrong? Thanks in advance

2 Likes

I’ve found that I have to specify session.type as realtime whenever I want to send a session.update event. I’ve also found that updating certain properties, like tools, results in audio completely breaking and turning into static, but I can fix that by re-specifying the audio input and output format.

I just posted about that here:

you’ll need to specify type if you don’t use the openai-beta: realtime=v1 header. The GA (non-beta) session format is also different than the beta format, transcription is now specified in a different place.

1 Like

Hi, any chance you could point to where in the docs we can find current syntax to include transcription of the input audio on the realtime SIP endpoint please? I also would like to get this working.

see the docs on the Session object, e.g., in session.update

I’ll also add this into hello-realtime sometime this week.

2 Likes

where is transcription specified now? I do not see it any the docs anywhere. Thanks

see the link above, or the hello-realtime code that configures this

this is the event I receive:

Received from WebSocket: {“type”:“conversation.item.input_audio_transcription.failed”,“event_id”:“event_CEh5kSVR7ytnWX8KFvea5”,“item_id”:“item_CEh5iYIXABFS0JpboGtpV”,“content_index”:0,“error”:{“type”:“server_error”,“code”:null,“message”:“Input transcription failed for item ‘item_CEh5iYIXABFS0JpboGtpV’. 403 Forbidden”,“param”:null}}

I have tried ‘whisper-1’ and ‘get-4o-transcribe’ models. this is the accept payload I am using:

call_accept = {

"type": "realtime",

“instructions”: “Your name is Janet. You are a helpful assistant”,

“model”: “gpt-realtime-2025-08-28”,

“audio”: {

“input”: {

“transcription”: {“model”:“whisper-1”}

    },

“output”: {“voice”:“alloy”}

}
1 Like

ok interesting, we’ve seen this before. What billing tier are you? https://platform.openai.com/settings/organization/limits

I am currently usage tier 1.

By the way, what can we pass to the /accept endpoint? All I could glean from the SIP tutorial for my Rust project is something like:

#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct AcceptCallRequest {
    /// This is *always* `realtime`. Convenience constructor exposed to ensure this.
    #[serde(rename = "type")]
    pub session_type: RealtimeCallSessionType,
    pub instructions: String,
    pub model: RealtimeModel,
}

But it seems like we can pass everything we could pass in a session.update client event once connected?

Yes, everything from session.update but also model.

1 Like
 "audio": {
            "input": {
                "transcription": {"model":"whisper-1"}
                },
            "output": {
                "voice": VOICE_NAME}
        },

but getting

Received Event: {
  "type": "conversation.item.done",
  "event_id": "event_CGdBIYGTuz8VzXGNcgpQX",
  "previous_item_id": "item_CGdBGi1yUL8Q4BlNnd4Eg",
  "item": {
    "id": "item_CGdBGBiPPt0yx0xkj5OdV",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "transcript": null
      }
    ]
  }

@rpendleton @juberti can anybody assist, using SIP routing

if you can post your session ID (sess_xxxx) we can take a closer look.

Hi team,

I’m implementing SIP calls with the OpenAI Realtime API and experiencing an issue with audio
transcription configuration.

Working setup:

  • Model: gpt-4o-mini-realtime-preview-2024-12-17
  • SIP calls work perfectly with basic config
  • (AI assistant) responds normally

Problem:
When I add input_audio_transcription to the call accept configuration, calls hang up immediately after
being accepted (status 200).

Configuration that causes hangup:

{
  "type": "realtime",
  "model": "gpt-4o-mini-realtime-preview-2024-12-17",
  "instructions": "You are Jessica...",
  "input_audio_transcription": {
    "model": "whisper-1",
    "language": "es"
  }
}

Questions:
1. Does gpt-4o-mini-realtime-preview-2024-12-17 support input_audio_transcription for SIP calls?
2. Should I use gpt-4o-transcribe models instead for SIP transcription?
3. Are there specific headers required for transcription in SIP calls?
4. Any known limitations with transcription + SIP integration?



The same configuration works in the documentation examples, but causes immediate hangups in SIP calls.

Any guidance would be greatly appreciated or any other way to make transcription work! thanks.

you’re using the beta session format without the beta header. you’ll either need to specify the beta header or update to the GA session format (audio → input → transcription)

1 Like

Thank you very much Jubert! GA session format worked perfectly! :right_facing_fist: :left_facing_fist:

@juberti Would you mind confirming that this worked for you and allowed you to transcribe user input audio? Much appreciated!

SIP Transcription Solution:

The Challenge

I needed to implement real-time voice transcription for SIP calls using OpenAI’s Realtime API. The main challenge was getting caller speech transcribed and saved properly while maintaining a stable SIP connection for voice calls.

The Key Breakthrough: Session Configuration

The most critical aspect was getting the session configuration right. I discovered that OpenAI’s SIP integration requires very specific session parameters:

const sessionConfig = {
  type: 'session.update',
  session: {
    type: 'realtime', // CRITICAL: Must specify session type
    audio: {
      input: {
        // Format must match actual audio received (G.711 μ-law from SIP)
        transcription: {
          model: 'whisper-1' // GA format, not beta
        }
      }
    }
  }
};

The Multi-Layered Solution

Since OpenAI’s realtime transcription wasn’t reliably working with SIP, I implemented a fallback transcription system:

  1. Primary: OpenAI Realtime API transcription via WebSocket events
  2. Fallback: Capture raw G.711 μ-law audio → convert to PCM → upsample to 24kHz → send to OpenAI Whisper API

Technical Implementation

  • SIP Integration: Used OpenAI’s /v1/realtime/calls/{callId}/accept endpoint
  • Audio Format Handling: G.711 μ-law (8kHz) from SIP → PCM (24kHz) for Whisper
  • Event Deduplication: Implemented processedItemIds Set to prevent duplicate transcripts
  • Noise Filtering: Added logic to filter out test patterns and meaningless audio

Session Update Timing

The timing of session updates proved crucial:

  • Send session.update immediately on WebSocket connection
  • Include session.type: 'realtime' (required for SIP)
  • Use GA format (audio.input.transcription) not beta format
  • Match audio formats between session config and actual SIP audio

The Result

I now have a working system that:

  • Maintains stable SIP voice connections
  • Captures caller transcripts through multiple pathways
  • Saves clean, deduplicated conversation logs to Firebase
  • Handles audio format mismatches gracefully
  • Provides real-time transcription for call screening

The key lesson: session configuration is everything when working with OpenAI’s SIP integration. Getting the format, timing, and parameters exactly right made the difference between a broken system and a working transcription pipeline.

Only downside is being charged for the transcription…cost is minimal though.

I hope this helps someone!

2 Likes

you should be able to send this to /accept, no session.update needed