I like how speech transcribing apps like fireflies.ai has the ability to distinguish between multiple speakers in the transcript. For example, speaker 1 said this, speaker 2 said this. I wonder if Whisper can do the same.
I have tried to dump a unstructured dialog between two people in Whisper, and ask it question like what did one speaker say and what did other speaker said after passing it to GPT for summarization. And surprisingly it’s able to distinguish that there are two speakers and here are the things that one speaker said, based on text alone. But I don’t think it can be entirely accurate nor can it format it into things like
Speaker 1: …
Speaker 2: …
Speaker 1: ….
As a complete transcript. Maybe it can, I haven’t tried it.
My suspicion is fireflies, which is able to do the above, analyzes the sound of a person’s voice to determine who spoke what?
What do you think, what is the proper way to achieve this?