GPT-4o Transcribe Diarize, a transcription model that identifies who’s speaking when, enables transcripts that clearly associate audio segments with individual speakers. This feature produces the new diarized_json response format, providing you with precise speaker labels along with segment start and end timestamps.
What’s included:
Automatic Speaker Identification: GPT-4o Transcribe Diarize automatically detects and labels different speakers, simplifying multi-speaker audio transcription.
Speaker Reference Clips: Optionally enhance accuracy by providing short (2–10 second) reference audio clips for up to four known speakers
API Endpoint: Available through /v1/audio/transcriptions in the Transcription API.
Speaker diarization has been frequently requested by our developer community; this feature represents a meaningful improvement to existing transcription tools.
Awesome… We’ve been waiting for diarization from OpenAI.
But one main missing feature is preventing us from adopting this model. And that’s being able to define the maximum/minimum number of users. We love the known speakers capability, but limiting the users is a crucial feature for us as diarization models can a lot of times hallucinate more users than they are. So it’s important for accuracy of our product to ensure that doesn’t happen.
I did some testing. But the diarization model seems way worse than the old 4o transcribe model.
The new model skips over words that the old model didn’t. Anyone else see this?