Can Whisper distinguish two speakers?

techjp · July 20, 2024, 7:40pm

A popular method is to combine the two and use time stamps to sync up the accurate whisper word detection with the other systems ability to detect who sad it and when.

I thought this seemed like an amazing idea, so I have tried to make it work. I have a JSON file created by Whisper, and another JSON file from Assembly AI. Now I am looking at the word timestamps in the files and…they do not match up.

It seems that Whisper can’t do timestamps itself and instead uses an external tool that tracks something like the length of time for each word, or the gap between words, something like that. It’s measured in seconds. Assembly AI on the other hand provides the actual timestamps in milliseconds for each word.

There does not appear to be an easy way to match these two up, or maybe I am missing something. Any tips or further thoughts on how to make this work? Help would be very much appreciated.

Topic		Replies	Views
Transcript: Amazon and Whisper merge? API whisper	2	2238	July 3, 2023
Whisper API: a) Timecodes; b) how good is open-source vs API? API whisper	9	6562	July 28, 2023
How to identify different speakers using whisper? Community whisper	3	32485	November 2, 2023
Whisper, how to tag different people in (sound) conversation API api	2	9006	June 8, 2023
How to transcribe two-person interview with Whisper API? API whisper	2	6080	December 21, 2023

Can Whisper distinguish two speakers?

Related topics