Whisper, how to tag different people in (sound) conversation

jtapiovaara · June 2, 2023, 6:25am

I am trying to get Whisper to tag a dialogue where there is more than one person speaking. Any idea of a prompt to guide Whisper to “tag” who is speaking and provide an answer along that rule. My whisper prompt is now as follows:
audio_file = open(f"{sound_file}", “rb”)
prompt = ‘If more than one person, then use html line breaks to separate them in your answer’
transcript = get_whisper(sound_system, audio_file, prompt)

And would like to have answer like:
person_1: …
person_2:…
person_1:…
Person_3:…

and so on.

I can get really amazing results from Whisper, but struggle with getting answer “tagged” for readability. Currently answer is “just” a bulk of text.

curt.kennedy · June 7, 2023, 9:56pm

Here is a video I ran across awhile ago, where they use Whisper (open source version) for the transcription, and AWS Transcribe to detect the speakers. Note: Also a GitHub link to code in the video.

They are using the timestamps from both streams to correlate the two. However, the Whisper API doesn’t support timestamps (as of now) whereas the Whisper open source version does.

Without the Whisper timestamps, if you like using the Whisper API, you could try getting the timestamps (and speakers) out of AWS Transcribe, and slice the audio file into pieces that correspond to each segment/speaker, and then send each segment over to the Whisper API for final transcription. This would be the easiest approach if you only want to use the Whisper API. Otherwise, use their approach with the open source version.

Personally, without even thinking too much, I’d rather get the timestamps and speakers from the AWS pass, and slice and send to the Whisper API. This is assuming the timestamps are accurate enough to not chop mid-word. Otherwise you might have to deep dive around each timestamp with something like pydub, to get a more accurate timestamp before you slice. You could also do word correlation in the AWS pass to the Whisper pass to sync things up without diving into pydub (or FFT’s or other things to find those transitions)

jtapiovaara · June 8, 2023, 6:30am

Thanks for this. Just checked the video and found the instructions OK and easy enough. Will test it later and post about the result.

Topic		Replies	Views
Can Whisper distinguish two speakers? API whisper	9	39890	August 5, 2024
How to transcribe two-person interview with Whisper API? API whisper	2	5869	December 21, 2023
Whisper API: a) Timecodes; b) how good is open-source vs API? API whisper	9	6427	July 28, 2023
How to identify different speakers using whisper? Community whisper	3	31039	November 2, 2023
Transcript: Amazon and Whisper merge? API whisper	2	2201	July 3, 2023

Whisper, how to tag different people in (sound) conversation

Related topics