Can Whisper distinguish two speakers?

I like how speech transcribing apps like fireflies.ai has the ability to distinguish between multiple speakers in the transcript. For example, speaker 1 said this, speaker 2 said this. I wonder if Whisper can do the same.

I have tried to dump a unstructured dialog between two people in Whisper, and ask it question like what did one speaker say and what did other speaker said after passing it to GPT for summarization. And surprisingly it’s able to distinguish that there are two speakers and here are the things that one speaker said, based on text alone. But I don’t think it can be entirely accurate nor can it format it into things like

Speaker 1: …
Speaker 2: …
Speaker 1: ….

As a complete transcript. Maybe it can, I haven’t tried it.

My suspicion is fireflies, which is able to do the above, analyzes the sound of a person’s voice to determine who spoke what?

What do you think, what is the proper way to achieve this?

3 Likes

This is not a feature of Whisper, there are other systems that can do this, but they typically are good at spotting who is saying what and when, but not nearly as good as whisper at determining what was said. A popular method is to combine the two and use time stamps to sync up the accurate whisper word detection with the other systems ability to detect who sad it and when.

I have not been able to distinguish between speakers using the prompt. Can you share more details on how you achieved this?

AssemblyAI has a dead simple to use, great model for this. Not affiliated have just found it really useful.

import assemblyai as aai

# Replace with your API token
aai.settings.api_key = f"{insert api token}"

# URL of the file to transcribe
FILE_URL = "https://github.com/AssemblyAI-Examples/audio-examples/raw/main/20230607_me_canadian_wildfires.mp3"

# You can also transcribe a local file by passing in a file path
# FILE_URL = './path/to/file.mp3'

config = aai.TranscriptionConfig(speaker_labels=True)

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(
  FILE_URL,
  config=config
)

for utterance in transcript.utterances:
  print(f"Speaker {utterance.speaker}: {utterance.text}")

5 Likes

is there a limit to the number of speakers using this method?

likewise, I used Assembly and found it suoper simple and accurate

Assembly is a great start. It’ll distinguish between Speaker A and Speaker B and so on. If you want something more specific (i.e. actual names of the speakers) then you may implement something like this…

1. Call an LLM API to look at the transcript from Assembly, identify each each speaker by name, and return the result as a JSON object with keys ‘Speaker A’, ‘Speaker B’, ‘Speaker C’, etc., and their corresponding full names as values.
This function does the following:
a) Prepares a message for the LLM with instructions and the transcript.
b) Sends the message to the LLM API (Claude in this case).
c) Receives the response containing speaker identifications.

def identify_speakers(transcript_text):
messages = [
{
“role”: “user”,
“content”: f"““Based on the following transcript, identify all speakers by their full names.
Return the result as a JSON object with keys ‘Speaker A’, ‘Speaker B’, ‘Speaker C’, etc., and their corresponding full names as values.
If you cannot confidently identify a speaker, use ‘Unknown’ as the value.
Include all speakers mentioned in the transcript, even if there are more than two.
Transcript:
{transcript_text}””"
}
]
response = client.messages.create(
model=“LLM of your choice”,
max_tokens=whatever,
messages=messages
)

2. Then python to parse the JSON:
This:
a) Extracts the text content from the response.
b) Attempts to parse it as JSON.
c) If parsing fails, it tries to find a JSON-like structure in the content.
d) Returns the parsed JSON or an error message if parsing fails.

content = response.content
if isinstance(content, list) and len(content) == 1 and hasattr(content[0], ‘text’):
content = content[0].text
try:
return json.loads(content)
except json.JSONDecodeError:
json_match = re.search(r’{.*}', content, re.DOTALL)
if json_match:
try:
return json.loads(json_match.group())
except json.JSONDecodeError:
pass
return {“error”: “Could not parse content”, “raw_content”: str(content)}

3. Then define a function to use the parsed JSON to replace speaker labels with names and return an updated transcript
This process:
a) Defines a function to replace speaker labels with full names.
b) Combines the AssemblyAI transcript utterances into a single string.
c) Calls the identify_speakers function to get the speaker mapping.
d) Uses the replace_speaker_labels function to update the transcript with full names.

def replace_speaker_labels(transcript_text, speaker_mapping):
for speaker_label, speaker_name in speaker_mapping.items():
transcript_text = transcript_text.replace(speaker_label, speaker_name)
return transcript_text
Get the full transcript text
full_transcript = “\n”.join([f"Speaker {u.speaker}: {u.text}" for u in transcript.utterances])
Identify speakers
speaker_mapping = identify_speakers(full_transcript)
Replace speaker labels with names
updated_transcript = replace_speaker_labels(full_transcript, speaker_mapping)

Just a question,

How long would it take AssemblyAI STT to transcribe one hour?

A popular method is to combine the two and use time stamps to sync up the accurate whisper word detection with the other systems ability to detect who sad it and when.

I thought this seemed like an amazing idea, so I have tried to make it work. I have a JSON file created by Whisper, and another JSON file from Assembly AI. Now I am looking at the word timestamps in the files and…they do not match up.

It seems that Whisper can’t do timestamps itself and instead uses an external tool that tracks something like the length of time for each word, or the gap between words, something like that. It’s measured in seconds. Assembly AI on the other hand provides the actual timestamps in milliseconds for each word.

There does not appear to be an easy way to match these two up, or maybe I am missing something. Any tips or further thoughts on how to make this work? Help would be very much appreciated.

Why don’t you let Assembly return the timestamps, cut the audio into pieces and then let Whisper do the transcribing?