Speech to text with diarization

Hello everyone ,

I want to do speech to text with derealization with whisper api , till now i succeed to transcript the audio file with two sides to text but without separate .
the goal is to separate to agent and customer.
tnx for your help

Yup, it’s quite a thing where the model doesn’t understand reality the way we do and then…

but then I decided to fix the autocorrect typo in the title, at least, and I suggest you search the forum. There are some solutions for this problem and the open source community has also contributed a lot, especially for the V2 model.

Hope this helps.

Whisper doesn’t do speaker diarization natively, you will have to use a separate model specifically for this purpose. Generally speaking you start by chunking the input based on who’s speaking and send those to whisper for transcription.

Any news on this topic. It has been more than an year. People having shared reference of WhisperX project but it has quite a lot of dependencies

@dvarshney86 What’s the use case? There are a few different options for speaker diarization.

If you’re trying to get a diarized transcript, first you’ll need to decide on the level/guarantee you need. You can do machine diarization on a single audio file, but that gets tricky when people talk over one another because the audio will often not capture what everyone said in that moment. If you’re able to get speaker separated streams/separate audio for each speaker then diarization becomes a whole lot easier and better quality.

Happy to chat about it more if you still need to figure out how to get diarized transcripts and have questions.

Whisper doesn’t natively support speaker diarization. If you wanted to get diarized transcripts, you’d have to use a diarization library like pyannote to segment the audio by speaker, then pass each segment to Whisper for transcription.

Unfortunately, you might still have mistakes using this approach because pyannote just uses AI to figure out who said what and it’s not always accurate. I’d look for an API that captures separate audio streams per speaker and can offer perfect speaker diarization, which will be a faster way of solving this problem.