How to setup transcription on Realtime API with SIP

Hi,

Can you describe better what was the issue with the transcription? Were you missing some text?

Do I get it correctly that you record the audio and send it to whisper in case the transcription via realtime-SIP fails?

What’s the audio config for? Is that used for the fallback, or you just configure what model should be used for transcription with realtime-SIP?

audio: {
      input: {
        // Format must match actual audio received (G.711 μ-law from SIP)
        transcription: {
          model: 'whisper-1' // GA format, not beta
        }
      }
    }

Can you provide an example please? The documentation does not show this. I’m only seeing transcript from OpenAI output, not my caller’s voice input.

Currently with my setup, I’ll be spending credits for both transcription and gpt-realtime.

Yes, exactly. It acts as a fallback when no transcription is received. The format is for the session.update API call. This is the format it expects. Specifically, the input, when my caller speaks, is the input to be transcribed. Separately, though, it appears OpenAI is transcribing output from the AI voice agent.

1 Like

I have a similar problem - it seems that transcribe does not transcribe the caller voice on (in my case) incoming SIP calls.

1 Like

Yes Jubert! please if you can provide an example :folded_hands: . Thank you as always!!!

see agent.ts in hello-realtime | Val Town . Note that for some reason some folks are getting conversation.item.input_audio_transcription.failed errors back when configuring transcription, still looking into that.

1 Like

if you are getting these .failed errors when setting up transcription, try this curl example to see if your API key is enabled for our Audio APIs.

I’m having this issue with the failed transcription events as well. Call is working perfectly. I accepted the call like this:

await oai_client.post(
                f"/realtime/calls/{call_id}/accept",
                body={
                    "type": "realtime",
                    "model": REALTIME_MODEL,
                    "instructions": REALTIME_INSTRUCTIONS,
                    "audio": {
                        "input": {
                            "turn_detection": {
                                "type": "server_vad",
                                "threshold": 0.65
                            },
                            "transcription": {
                                "language": "en",
                                "model": "gpt-4o-transcribe-latest"
                            },
                        },
                        "output": {
                            "voice": "shimmer",
                            "speed": 1.25,
                        },
                    },
                },
                cast_to=httpx.Response
            )

But I get failed events like

{"type":"conversation.item.input_audio_transcription.failed","event_id":"event_CMKd5yz496AvZGzB2Q49a","item_id":"item_CMKd5yUAuaEkiFEWUp6sW","content_index":0,"error":{"type":"server_error","code":null,"message":"Input transcription failed for item 'item_CMKd5yUAuaEkiFEWUp6sW'.","param":null}}

@juberti I verified that the curl command you specified works just fine with the same API key.

Ah okay–it works with gpt-4o-transcribe as the model instead of gpt-4o-transcribe-latest, so maybe the docs just need updating?

The model to use for transcription. Current options are whisper-1, gpt-4o-transcribe-latest, gpt-4o-mini-transcribe, and gpt-4o-transcribe.

Thanks for helping figure this out, we’ll make sure the docs get updated so others don’t hit this issue.

1 Like

https://platform.openai.com/docs/api-reference/realtime-server-events/conversation/item/input_audio_transcription/completed

Above api-reference helps to get input audio transcription.

1 Like