Best solution for Whisper diarization/speaker labeling?

ianwatts · November 16, 2023, 12:28am

Wondering what the state of the art is for diarization using Whisper, or if OpenAI has revealed any plans for native implementations in the pipeline. I’ve found some that can run locally, but ideally I’d still be able to use the API for speed and convenience.

Google Cloud Speech-to-Text has built-in diarization, but I’d rather keep my tech stack all OpenAI if I can, and believe Whisper is better regardless.

nikola1jankovic · November 22, 2023, 9:02am

Yes, this is a shortcoming. I have tried using Whisper in combination with Pyannote, but the result is a bit complicated to implement, plus the results are not even close to ideal.

ianwatts · December 6, 2023, 5:55am

This library picked up a bunch of steam, haven’t used yet but everything I’ve read and seen looks pretty amazing. Runs locally, doesn’t use API, but seems to be especially fast.

poldus · December 12, 2023, 11:20am

This still cannot differentiate between speakers.

dy776czjz · February 2, 2024, 8:53pm

Vaibhavs10/insanely-fast-whisper/blob/main/src/insanely_fast_whisper/utils/diarization_pipeline.py

Seems like it does

bcastaing02 · March 8, 2024, 12:18pm

Hi there, so finaly, did you find the best solution for diarization ?

prafull.soni · April 4, 2024, 5:39am

I am looking for fast inference model which can diarize faster then PyAnnote

vasyl · April 11, 2024, 6:59pm

well… not so 'short’coming as we can see (by the date it was posted and where we are today)

aPeaceOfAdam · May 17, 2024, 4:16am

So all diarization can be sketchy, it’s one of those areas where the shiny marketing material definitely does not match reality.

However, what I have noticed gives really unusually good results is gpt-4o, when instructed to make a best guess as to different speakers based on conversation style. People use very different words and phrases and gpt-4o has given some really good results with a prompt like this:

"there are three speakers in this transcription, make a best guess, based on speaking styles, at diarising it: … "

It works even better if you give the names of the people and indicate what one of their views in the conversation is (giving it something to get a foothold on)

supershaneski · May 17, 2024, 5:27am

do you have access to gpt-4o with audio input? how about audio output? can you tell us more about your experience using it? us mere mortals can only use text and image input and text output for now. i’m interested how audio input and audio output works.

aPeaceOfAdam · May 18, 2024, 6:58am

I’m doing the same thing as you, nothing above moral level

I’m using whisper to create a general, non diarised transcript then I’m feeding that transcript into gpt-4o with the prompt and hints and it does really well at guessing at different people.

I then present that to the user in a UI for final checks and changes

Dayrion · May 20, 2024, 3:44pm

Could you explain in more detail how to ask GPT-4o to transcribe audio and identify speakers? I have successfully transcribed my audio into text, but when I pass it to GPT, it attempts to use Python code to complete the task, which is unsuccessful.
Also, I’ve been using an Assistant with GPT-4o, JSON output but won’t do the full transcription into multiple speakers.

nikola1jankovic · May 22, 2024, 2:00pm

Interesting approach, but I am not sure it can be more precise than PyAnnote. Have you tried both and compared?

jo.le · May 28, 2024, 9:23am

import azure.cognitiveservices.speech as speechsdk
conversation_transcriber = speechsdk.transcription.ConversationTranscriber(speech_config=speech_config)

does a really good job at diarizing, but the recognition isn’t as good as with whisper

Pouya91 · May 28, 2024, 3:11pm

Here is a way to do it:

picovoice.ai/blog/falcon-whisper-integration/

Please note, I am in the Falcon development team.

phj.seddon · June 25, 2024, 3:25pm

Interesting… If I understand correctly Falcon appears to rely on the Whisper segmentation and trying to assign the most likely speaker to each segment. If this is the case does it not depend on how accurate is the segmentation is in the first place, because most segmentations are based on silence gaps between utterances. What will happen to the partially overlapped speech?

moba1720902 · October 2, 2024, 10:53am

It still uses pyannote pipeline. Just so people know. Don’t know if the performance is better in any way here.

chandanmb7 · October 3, 2024, 8:50am

Can you please explain me a bit more about your prompts which you are feeding in for gpt-4o ?

yassir.habek · December 6, 2024, 10:31am

I used the Azure Batch Speech to Text. for the Dutch language it uses Whisper with the diarization. the results are pretty good but note that Azure is pretty expensive and the use of it is pretty hard.

deepm · December 18, 2024, 1:38am

Although I have not tried it, I found this repo on gh: kadirnar/whisper-plus , which uses PyAnnote and Whisper-3, and it seems promising.

Topic		Replies	Views
Thoughts on Whisper-3 announcement API whisper	5	11236	November 7, 2023
Can Whisper distinguish two speakers? API whisper	9	36934	August 5, 2024
Whisper API: a) Timecodes; b) how good is open-source vs API? API whisper	9	6301	July 28, 2023
Speech to Text (ASR) Strategy Community whisper , audio , gpt-4o-audio-preview	8	266	March 10, 2025
Whisper API at Azure - more technically advanced, but the price? API whisper	1	4331	December 17, 2023

Best solution for Whisper diarization/speaker labeling?

Related topics