Best solution for Whisper diarization/speaker labeling?

Wondering what the state of the art is for diarization using Whisper, or if OpenAI has revealed any plans for native implementations in the pipeline. I’ve found some that can run locally, but ideally I’d still be able to use the API for speed and convenience.

Google Cloud Speech-to-Text has built-in diarization, but I’d rather keep my tech stack all OpenAI if I can, and believe Whisper is better regardless.

1 Like

Yes, this is a shortcoming. I have tried using Whisper in combination with Pyannote, but the result is a bit complicated to implement, plus the results are not even close to ideal.