Can Whisper distinguish two speakers?

A popular method is to combine the two and use time stamps to sync up the accurate whisper word detection with the other systems ability to detect who sad it and when.

I thought this seemed like an amazing idea, so I have tried to make it work. I have a JSON file created by Whisper, and another JSON file from Assembly AI. Now I am looking at the word timestamps in the files and…they do not match up.

It seems that Whisper can’t do timestamps itself and instead uses an external tool that tracks something like the length of time for each word, or the gap between words, something like that. It’s measured in seconds. Assembly AI on the other hand provides the actual timestamps in milliseconds for each word.

There does not appear to be an easy way to match these two up, or maybe I am missing something. Any tips or further thoughts on how to make this work? Help would be very much appreciated.

1 Like