Speech to text with diarization

Whisper doesn’t natively support speaker diarization. If you wanted to get diarized transcripts, you’d have to use a diarization library like pyannote to segment the audio by speaker, then pass each segment to Whisper for transcription.

Unfortunately, you might still have mistakes using this approach because pyannote just uses AI to figure out who said what and it’s not always accurate. I’d look for an API that captures separate audio streams per speaker and can offer perfect speaker diarization, which will be a faster way of solving this problem.