Best solution for Whisper diarization/speaker labeling?

Wondering what the state of the art is for diarization using Whisper, or if OpenAI has revealed any plans for native implementations in the pipeline. I’ve found some that can run locally, but ideally I’d still be able to use the API for speed and convenience.

Google Cloud Speech-to-Text has built-in diarization, but I’d rather keep my tech stack all OpenAI if I can, and believe Whisper is better regardless.


Yes, this is a shortcoming. I have tried using Whisper in combination with Pyannote, but the result is a bit complicated to implement, plus the results are not even close to ideal.


This library picked up a bunch of steam, haven’t used yet but everything I’ve read and seen looks pretty amazing. Runs locally, doesn’t use API, but seems to be especially fast.

This still cannot differentiate between speakers.


Seems like it does

Hi there, so finaly, did you find the best solution for diarization ?

I am looking for fast inference model which can diarize faster then PyAnnote

well… not so 'short’coming as we can see (by the date it was posted and where we are today) :sweat_smile:

1 Like