Introducing GPT-4o Transcribe Diarize: Now Available in the Audio API

GPT-4o Transcribe Diarize, a transcription model that identifies who’s speaking when, enables transcripts that clearly associate audio segments with individual speakers. This feature produces the new diarized_json response format, providing you with precise speaker labels along with segment start and end timestamps.

What’s included:

  • Automatic Speaker Identification: GPT-4o Transcribe Diarize automatically detects and labels different speakers, simplifying multi-speaker audio transcription.
  • Speaker Reference Clips: Optionally enhance accuracy by providing short (2–10 second) reference audio clips for up to four known speakers
  • API Endpoint: Available through /v1/audio/transcriptions in the Transcription API.

Speaker diarization has been frequently requested by our developer community; this feature represents a meaningful improvement to existing transcription tools.

Check out the documentation and the API reference to get started and explore detailed examples.

Looking forward to seeing how you utilize this feature!

8 Likes

Saw the model earlier in code pushed yesterday - it’s not been put on the models endpoint yet.

Here is the text from the official announcement email:

Thanks to @multitechvisions for sharing!

2 Likes

Awesome… We’ve been waiting for diarization from OpenAI.

But one main missing feature is preventing us from adopting this model. And that’s being able to define the maximum/minimum number of users. We love the known speakers capability, but limiting the users is a crucial feature for us as diarization models can a lot of times hallucinate more users than they are. So it’s important for accuracy of our product to ensure that doesn’t happen.

2 Likes

I did some testing. But the diarization model seems way worse than the old 4o transcribe model.
The new model skips over words that the old model didn’t. Anyone else see this?

1 Like

I’ve had the same experience: the new model completely misses some sentences, which significantly impacts production.

The quantitative evaluation results are also worse.

Please consider restoring the old model on the gpt-4o-transcribe endpoint.

1 Like

Yes, it does hallucinate very badly.