Transcription config for `gpt-4o-mini-transcribe` doesn't work?

Hi,

I’m trying to use the new STT model. I’m sending the following frame to the “wss://api.openai.com/v1/realtime?intent=transcription” WebSocket endpoint:

{
  "type": "transcription_session.update",
  "include": [
    "item.input_audio_transcription.logprobs"
  ],
  "input_audio_format": "pcm16",
  "input_audio_transcription": {
    "prompt": "",
    "language": "",
    "model": "gpt-4o-mini-transcribe"
  },
  "turn_detection": {
    "type": "server_vad",
    "threshold": 0.5,
    "prefix_padding_ms": 300,
    "silence_duration_ms": 500,
    "create_response": true
  },
  "input_audio_noise_reduction": {
    "type": "near_field"
  }
}

but I get the response

{
  "type": "transcription_session.created",
  "event_id": "event_BDIjP2WyEWBCFq31A1QrP",
  "session": {
    "id": "sess_BDIjPc4zRbhQgBDcJSDXL",
    "object": "realtime.transcription_session",
    "expires_at": 1742511767,
    "input_audio_noise_reduction": null,
    "turn_detection": {
      "type": "server_vad",
      "threshold": 0.5,
      "prefix_padding_ms": 300,
      "silence_duration_ms": 200
    },
    "input_audio_format": "pcm16",
    "input_audio_transcription": null, # <--- this became null
    "client_secret": null,
    "include": null
  }
}

The transcription also doesn’t work. I do get speech_started and speech_stopped events:

"{\"type\":\"input_audio_buffer.speech_started\",\"event_id\":\"event_BDIjZmZPzKtGoZb3BG1g2\",\"audio_start_ms\":9780,\"item_id\":\"item_BDIjZcVYwyGgsUmbpHA83\"}"

but instead of transcript events as documented I seem to receive conversation.item events:

"{\"type\":\"conversation.item.created\",\"event_id\":\"event_BDIja3273Hm7Ufs4GC5rL\",\"previous_item_id\":null,\"item\":{\"id\":\"item_BDIjZcVYwyGgsUmbpHA83\",\"object\":\"realtime.item\",\"type\":\"message\",\"status\":\"completed\",\"role\":\"user\",\"content\":[{\"type\":\"input_audio\",\"transcript\":null}]}}"

From the responses, I suspect that it’s still treating the connection as the non-transcription Realtime API. Not sure what I’m doing wrong though.

Any ideas would be appreciated.

2 Likes

I can throw random ideas your way.

I don’t know if the URL query string is correct or needed. I’m guessing you found that somewhere, but the API reference doesn’t have it.

logprobs is only an include on transcriptions, not realtime.

List models by API. Try the dated version instead of alias.

The model whisper-1 is also supported. You can see if it is the new model itself causing the necessary input_audio_transcription details to be dropped.

If you application is not continuous realtime audio, monitoring for when someone speaks with VAD and triggering transcription response events, the transcriptions endpoint also has the new models available.

That’s just stuff thrown against the wall so far.

Thanks for the ideas, but they didn’t work.

The URL param is listed here: https://platform.openai.com/docs/guides/speech-to-text#streaming-the-transcription-of-an-ongoing-audio-recording

Tried with whisper-1 and without logprobs and that didn’t work either.

Looks like the documentation is not up-to-date.
Right now you actually have to use the “session” field to work or it will throw an error

{
  "type": "transcription_session.update",
  "session": {
    "input_audio_format": "pcm16",
    "input_audio_transcription": {"model": "gpt-4o-mini-transcribe"},
    "turn_detection": {
      "type": "server_vad",
      "threshold": 0.5,
      "prefix_padding_ms": 300,
      "silence_duration_ms": 500,
    },
    "input_audio_noise_reduction": {
      "type": "near_field"
    },
  }
}
1 Like

Thanks! I realized the same thing and have pinged someone from the OpenAI team on Twitter about this.