Can Whisper distinguish two speakers?

I like how speech transcribing apps like has the ability to distinguish between multiple speakers in the transcript. For example, speaker 1 said this, speaker 2 said this. I wonder if Whisper can do the same.

I have tried to dump a unstructured dialog between two people in Whisper, and ask it question like what did one speaker say and what did other speaker said after passing it to GPT for summarization. And surprisingly it’s able to distinguish that there are two speakers and here are the things that one speaker said, based on text alone. But I don’t think it can be entirely accurate nor can it format it into things like

Speaker 1: …
Speaker 2: …
Speaker 1: ….

As a complete transcript. Maybe it can, I haven’t tried it.

My suspicion is fireflies, which is able to do the above, analyzes the sound of a person’s voice to determine who spoke what?

What do you think, what is the proper way to achieve this?

This is not a feature of Whisper, there are other systems that can do this, but they typically are good at spotting who is saying what and when, but not nearly as good as whisper at determining what was said. A popular method is to combine the two and use time stamps to sync up the accurate whisper word detection with the other systems ability to detect who sad it and when.