I’ve been testing the new Realtime API with SIP integration over Twilio, the realtime conversation part works just fine. However, I haven’t figured out how to get the transcription for the audio input. I can get the full transcription of the model’s response, but I’m unable to retrieve the transcription from the user .
This is the only event related to the transcription that I receive:
I’ve found that I have to specify session.type as realtime whenever I want to send a session.update event. I’ve also found that updating certain properties, like tools, results in audio completely breaking and turning into static, but I can fix that by re-specifying the audio input and output format.
you’ll need to specify type if you don’t use the openai-beta: realtime=v1 header. The GA (non-beta) session format is also different than the beta format, transcription is now specified in a different place.
Hi, any chance you could point to where in the docs we can find current syntax to include transcription of the input audio on the realtime SIP endpoint please? I also would like to get this working.
Received from WebSocket: {“type”:“conversation.item.input_audio_transcription.failed”,“event_id”:“event_CEh5kSVR7ytnWX8KFvea5”,“item_id”:“item_CEh5iYIXABFS0JpboGtpV”,“content_index”:0,“error”:{“type”:“server_error”,“code”:null,“message”:“Input transcription failed for item ‘item_CEh5iYIXABFS0JpboGtpV’. 403 Forbidden”,“param”:null}}
I have tried ‘whisper-1’ and ‘get-4o-transcribe’ models. this is the accept payload I am using:
call_accept = {
"type": "realtime",
“instructions”: “Your name is Janet. You are a helpful assistant”,
I’m implementing SIP calls with the OpenAI Realtime API and experiencing an issue with audio
transcription configuration.
Working setup:
Model: gpt-4o-mini-realtime-preview-2024-12-17
SIP calls work perfectly with basic config
(AI assistant) responds normally
Problem:
When I add input_audio_transcription to the call accept configuration, calls hang up immediately after
being accepted (status 200).
Configuration that causes hangup:
{
"type": "realtime",
"model": "gpt-4o-mini-realtime-preview-2024-12-17",
"instructions": "You are Jessica...",
"input_audio_transcription": {
"model": "whisper-1",
"language": "es"
}
}
Questions:
1. Does gpt-4o-mini-realtime-preview-2024-12-17 support input_audio_transcription for SIP calls?
2. Should I use gpt-4o-transcribe models instead for SIP transcription?
3. Are there specific headers required for transcription in SIP calls?
4. Any known limitations with transcription + SIP integration?
The same configuration works in the documentation examples, but causes immediate hangups in SIP calls.
Any guidance would be greatly appreciated or any other way to make transcription work! thanks.
you’re using the beta session format without the beta header. you’ll either need to specify the beta header or update to the GA session format (audio → input → transcription)
I needed to implement real-time voice transcription for SIP calls using OpenAI’s Realtime API. The main challenge was getting caller speech transcribed and saved properly while maintaining a stable SIP connection for voice calls.
The Key Breakthrough: Session Configuration
The most critical aspect was getting the session configuration right. I discovered that OpenAI’s SIP integration requires very specific session parameters:
const sessionConfig = {
type: 'session.update',
session: {
type: 'realtime', // CRITICAL: Must specify session type
audio: {
input: {
// Format must match actual audio received (G.711 μ-law from SIP)
transcription: {
model: 'whisper-1' // GA format, not beta
}
}
}
}
};
The Multi-Layered Solution
Since OpenAI’s realtime transcription wasn’t reliably working with SIP, I implemented a fallback transcription system:
Primary: OpenAI Realtime API transcription via WebSocket events
Fallback: Capture raw G.711 μ-law audio → convert to PCM → upsample to 24kHz → send to OpenAI Whisper API
Technical Implementation
SIP Integration: Used OpenAI’s /v1/realtime/calls/{callId}/accept endpoint
Audio Format Handling: G.711 μ-law (8kHz) from SIP → PCM (24kHz) for Whisper
Event Deduplication: Implemented processedItemIds Set to prevent duplicate transcripts
Noise Filtering: Added logic to filter out test patterns and meaningless audio
Session Update Timing
The timing of session updates proved crucial:
Send session.update immediately on WebSocket connection
Include session.type: 'realtime' (required for SIP)
Use GA format (audio.input.transcription) not beta format
Match audio formats between session config and actual SIP audio
The Result
I now have a working system that:
Maintains stable SIP voice connections
Captures caller transcripts through multiple pathways
Saves clean, deduplicated conversation logs to Firebase
Handles audio format mismatches gracefully
Provides real-time transcription for call screening
The key lesson: session configuration is everything when working with OpenAI’s SIP integration. Getting the format, timing, and parameters exactly right made the difference between a broken system and a working transcription pipeline.
Only downside is being charged for the transcription…cost is minimal though.