They just released a new model: gpt-4o-transcribe-diarize. Initial results look very promising.
Yeah, a lot of people use Whisper for transcription + pyannote for the diarization, though this isn’t 100% accurate. Might be worth looking into an API that combines transcription and diarization for better results. There are some APIs that offer perfect diarization, I would look for one of those.
Hi Dears
I think there is a model that will be available soon.. GPT-4o Transcribe Diarize
Well, it looks like it is released, but from the SDK we are using (.NET) it became available just now. (couple of days ago)
We are testing this right now as we speak.
Currently when clients select openAI for transcription, we are diarizing via the prompt itself. (We tell the prompt to detect the speakers and diarize). It somewhat works okay. People use it.
This is why we need to get this working GPT-4o Transcribe Diarize so we can remove this workaround from the system prompt.
Really appreciate you sharing this update here, @nbo2, this is exactly the kind of signal people were waiting for.
Totally get why diarization has been a bit of a pain so far. Prompt-based speaker labeling works, but yeah… it’s more of a workaround than a real solution.
What you called out lines up with what others have been seeing:
- GPT-4o Transcribe Diarize is now starting to roll out
- SDK support (like .NET) is only just catching up
- And teams are actively testing it to replace the prompt-based approach
That shift matters. Moving from “ask the model to guess speakers” → to actual diarization support should make things way more reliable and remove a lot of prompt complexity.
For folks following this thread: if you’ve been stitching together diarization via prompts, this is probably the moment to start testing the new model and see how it compares in your setup.
@nbo2 if you notice any differences in accuracy or edge cases as you test, would be great to hear, especially vs the prompt-based method.
Mark G.