Best solution for Whisper diarization/speaker labeling?

Wondering what the state of the art is for diarization using Whisper, or if OpenAI has revealed any plans for native implementations in the pipeline. I’ve found some that can run locally, but ideally I’d still be able to use the API for speed and convenience.

Google Cloud Speech-to-Text has built-in diarization, but I’d rather keep my tech stack all OpenAI if I can, and believe Whisper is better regardless.


Yes, this is a shortcoming. I have tried using Whisper in combination with Pyannote, but the result is a bit complicated to implement, plus the results are not even close to ideal.


This library picked up a bunch of steam, haven’t used yet but everything I’ve read and seen looks pretty amazing. Runs locally, doesn’t use API, but seems to be especially fast.

This still cannot differentiate between speakers.


Seems like it does

Hi there, so finaly, did you find the best solution for diarization ?

I am looking for fast inference model which can diarize faster then PyAnnote

well… not so 'short’coming as we can see (by the date it was posted and where we are today) :sweat_smile:


So all diarization can be sketchy, it’s one of those areas where the shiny marketing material definitely does not match reality.

However, what I have noticed gives really unusually good results is gpt-4o, when instructed to make a best guess as to different speakers based on conversation style. People use very different words and phrases and gpt-4o has given some really good results with a prompt like this:

"there are three speakers in this transcription, make a best guess, based on speaking styles, at diarising it: … "

It works even better if you give the names of the people and indicate what one of their views in the conversation is (giving it something to get a foothold on)

1 Like

do you have access to gpt-4o with audio input? how about audio output? can you tell us more about your experience using it? us mere mortals can only use text and image input and text output for now. i’m interested how audio input and audio output works.

I’m doing the same thing as you, nothing above moral level :slight_smile:

I’m using whisper to create a general, non diarised transcript then I’m feeding that transcript into gpt-4o with the prompt and hints and it does really well at guessing at different people.

I then present that to the user in a UI for final checks and changes

1 Like

Could you explain in more detail how to ask GPT-4o to transcribe audio and identify speakers? I have successfully transcribed my audio into text, but when I pass it to GPT, it attempts to use Python code to complete the task, which is unsuccessful.
Also, I’ve been using an Assistant with GPT-4o, JSON output but won’t do the full transcription into multiple speakers.

Interesting approach, but I am not sure it can be more precise than PyAnnote. Have you tried both and compared?

Use the Microsoft Speech API and pipe that response into the API.

This is the way to do it for anyone wondering. It’s the legit way to do it where long form audio can be transcribed efficiently.

Keep in mind using the Batch API is 18 cents an hour.

and the real time is $1, so think about the costs to your app or business.

import azure.cognitiveservices.speech as speechsdk
conversation_transcriber = speechsdk.transcription.ConversationTranscriber(speech_config=speech_config)

does a really good job at diarizing, but the recognition isn’t as good as with whisper

Here is a way to do it:

Please note, I am in the Falcon development team.

Interesting… If I understand correctly Falcon appears to rely on the Whisper segmentation and trying to assign the most likely speaker to each segment. If this is the case does it not depend on how accurate is the segmentation is in the first place, because most segmentations are based on silence gaps between utterances. What will happen to the partially overlapped speech?