Implementing gpt-realtime and gpt4-4o-transcribe for a streaming transcription

first, i folowed the openai docs and successfully implemented the gpt-realtime conversation (using webrtc),

next, am trying to implement the transscription with the realtime (rt).

what is want is: 1. talk in mic, 2. the text appears in the UI as is, 3. 4o-transcribe does whatever is in its prompt to post process

following the docs i tried but i keep getting errors that are sent by the api and it seems the docs/actual usage has some issue. quite possibly, i cd be doing something wrong as well.

currently, i get a first transcripted but its not what i said and its in korean or japanese. after this the speech stops.

sendSessionUpdate()

private sendSessionUpdate(): void {
  if (!this.websocket || this.websocket.readyState !== WebSocket.OPEN) {
    return;
  }

  const configMessage = {
    type: 'transcription_session.update', // ❌ API rejects this
    input_audio_format: 'pcm16',
    input_audio_transcription: {
      model: 'gpt-4o-transcribe',
      prompt: '',
      language: 'en'
    },
    turn_detection: {
      type: 'server_vad',
      threshold: this.config.turn_threshold,
      prefix_padding_ms: this.config.turn_prefix_padding_ms,
      silence_duration_ms: this.config.turn_silence_duration_ms
    },
    input_audio_noise_reduction: {
      type: this.config.noise_reduction
    },
    include: ['item.input_audio_transcription.logprobs']
  };

  console.log('Sending configuration:', configMessage);
  this.websocket.send(JSON.stringify(configMessage));
}

websocket:

const subprotocols = [
“realtime”,
openai-insecure-api-key.${clientSecret}
];
this.websocket = new WebSocket(wss://api.openai.com/v1/realtime?intent=transcription, subprotocols);

server side token gen:

const sessionConfig = {
session: {
type: “realtime”,
model: “gpt-realtime”, // :white_check_mark: This works
instructions: “You are a transcription assistant. Only transcribe the user’s speech accurately. Do not generate any responses or additional text. Only output the exact words spoken.”,
output_modalities: [“text”],
audio: {
input: {
transcription: {
model: “gpt-4o-transcribe” // :white_check_mark: This works
}
}
}
}
};

Is transcription_session.update actually supported? The API says no, but docs say yes.

How do we configure transcription-only mode? We want continuous transcription, not conversation.

What’s the correct message format? Should we use session.update with different parameters?

Is there a different endpoint or approach? Maybe we need a different URL or authentication method?

Why does it switch to response generation? After one transcription, it stops listening and starts generating responses.

any help wd be greatly appreciated!

2 Likes

I have exactly the same problem. Trying to get the transcription model working, but getting following error from websockets

"Model \"gpt-4o-transcribe\" is not supported in realtime mode. See https://platform.openai.com/docs/models for a list of supported models."

My setup:

  1. Calling from the server the /v1/realtime/client_secrets/v1/realtime/client_secrets to get token (works)
  2. Starting the websocket connection from the client to the endpoint wss://api.openai.com/v1/realtime?model=gpt-4o-transcribe&intent=transcription
  3. The websocket setting for the token [‘realtime’, ‘openai-insecure-api-key.’+ephemeral as string]

The connection open and immediately close with that error.

Can anyone please help?

yes! … the connection opens and closes… i have written to support….hopefully i ll get a response and i ll let you know

@Mansoor_Iqbal If you are trying to set up a realtime transcription mode, then I think the doc here may have not been updated to the latest version.

From the API doc it looks like the type for realtime transcription is also session.update now, and the payload session.type should be transcription.

Also the input_audio_format, input_audio_transcription, turn_detection, and input_audio_noise_reduction are now under audio.input property.

@flesicek For the WebSocket connection URL for realtime transcription only mode, try this: wss://api.openai.com/v1/realtime?intent=transcription, since the transcribe model for the realtime transcription mode is specified when doing session.update.

2 Likes

Thank you, there were 2 issues:

  1. the URL
  2. when generating client secrets, the session type needed to be set to transcription
{
          "expires_after": {
            "anchor": "created_at",
            "seconds": 600
          },
          "session": {
            "type": "transcription",
          }
         

Now it works on my end, but I have one more issue. During the realtime transcription with gpt-4o-transcribe all the delta changes are received at one on the end of the turn.

Is there a way to configure the transcription to receive delta data continuously?

During the realtime transcription with gpt-4o-transcribe all the delta changes are received at one on the end of the turn.

Is there a way to configure the transcription to receive delta data continuously?

@flesicek I think the closest wa is to listen to the server event conversation.item.input_audio_transcription.delta.

Yes I’m listening to them, but they are coming always after “speech stop” event. And this is the issue.

1 Like

Managed to get a web socket solution working as well. But I may have hit the same issue you are seeing, @flesicek. Will report back if I get it working.

I followed most of @ianyuhsunlin suggestions, and I managed to get it working.

To add some more details, I am using Azure OpenAI, Realtime API preview version (not GA version, as only preview version seems to be compatible with Azure OpenAI Realtime API, currently), WebSocket connection, and I also used `session.update`. In general I just followed the documentations already mentioned (but likely only relevant for `preview` version). The API I used did not have the `audio.input` property it seemed like, likely added in the newer API release. If anyone is using Azure OpenAI, I hope this helps.

Not sure why you get all the deltas at the end, @flesicek. Maybe there is something blocking them from being sent from the backend. I just did async calls to send the deltas through the WebSocket. Probably something gone wrong there, on your end, hopefully.


EDIT: I also used `server_vad` for voice activity detection (VAD) instead of default `none`. Not sure thats why it did not work for you. Just mentioning it here.

Side question: Do you see webRTC meaningfully improve your latency KPIs? The document touts WebRTC being faster due to lighter protocol overhead. But for my transcription use cases I see websocket based connections consistently being faster than webrtc. I’m curious if your experience is different.