Whisper, how to tag different people in (sound) conversation

I am trying to get Whisper to tag a dialogue where there is more than one person speaking. Any idea of a prompt to guide Whisper to “tag” who is speaking and provide an answer along that rule. My whisper prompt is now as follows:
audio_file = open(f"{sound_file}", “rb”)
prompt = ‘If more than one person, then use html line breaks to separate them in your answer’
transcript = get_whisper(sound_system, audio_file, prompt)

And would like to have answer like:
person_1: …

and so on.

I can get really amazing results from Whisper, but struggle with getting answer “tagged” for readability. Currently answer is “just” a bulk of text.

Here is a video I ran across awhile ago, where they use Whisper (open source version) for the transcription, and AWS Transcribe to detect the speakers. Note: Also a GitHub link to code in the video.

They are using the timestamps from both streams to correlate the two. However, the Whisper API doesn’t support timestamps (as of now) whereas the Whisper open source version does.

Without the Whisper timestamps, if you like using the Whisper API, you could try getting the timestamps (and speakers) out of AWS Transcribe, and slice the audio file into pieces that correspond to each segment/speaker, and then send each segment over to the Whisper API for final transcription. This would be the easiest approach if you only want to use the Whisper API. Otherwise, use their approach with the open source version.

Personally, without even thinking too much, I’d rather get the timestamps and speakers from the AWS pass, and slice and send to the Whisper API. This is assuming the timestamps are accurate enough to not chop mid-word. Otherwise you might have to deep dive around each timestamp with something like pydub, to get a more accurate timestamp before you slice. You could also do word correlation in the AWS pass to the Whisper pass to sync things up without diving into pydub (or FFT’s or other things to find those transitions)


Thanks for this. Just checked the video and found the instructions OK and easy enough. Will test it later and post about the result.

1 Like