Can Whisper distinguish two speakers?

I like how speech transcribing apps like fireflies.ai has the ability to distinguish between multiple speakers in the transcript. For example, speaker 1 said this, speaker 2 said this. I wonder if Whisper can do the same.

I have tried to dump a unstructured dialog between two people in Whisper, and ask it question like what did one speaker say and what did other speaker said after passing it to GPT for summarization. And surprisingly it’s able to distinguish that there are two speakers and here are the things that one speaker said, based on text alone. But I don’t think it can be entirely accurate nor can it format it into things like

Speaker 1: …
Speaker 2: …
Speaker 1: ….

As a complete transcript. Maybe it can, I haven’t tried it.

My suspicion is fireflies, which is able to do the above, analyzes the sound of a person’s voice to determine who spoke what?

What do you think, what is the proper way to achieve this?

1 Like

This is not a feature of Whisper, there are other systems that can do this, but they typically are good at spotting who is saying what and when, but not nearly as good as whisper at determining what was said. A popular method is to combine the two and use time stamps to sync up the accurate whisper word detection with the other systems ability to detect who sad it and when.

I have not been able to distinguish between speakers using the prompt. Can you share more details on how you achieved this?

AssemblyAI has a dead simple to use, great model for this. Not affiliated have just found it really useful.

import assemblyai as aai

# Replace with your API token
aai.settings.api_key = f"{insert api token}"

# URL of the file to transcribe
FILE_URL = "https://github.com/AssemblyAI-Examples/audio-examples/raw/main/20230607_me_canadian_wildfires.mp3"

# You can also transcribe a local file by passing in a file path
# FILE_URL = './path/to/file.mp3'

config = aai.TranscriptionConfig(speaker_labels=True)

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(
  FILE_URL,
  config=config
)

for utterance in transcript.utterances:
  print(f"Speaker {utterance.speaker}: {utterance.text}")

3 Likes

is there a limit to the number of speakers using this method?

likewise, I used Assembly and found it suoper simple and accurate