Transcript: Amazon and Whisper merge?

I have many audio files where two people are speaking that need to be transcribed. I tried it with Whisper from OpenAI, which works perfectly. Unfortunately, Whisper can’t distinguish between 2 speakers.

Now, I have tried Amazon Transcribe. Amazon can distinguish speakers, but is much worse at transcribing than Whisper.

Is there any way I can “merge” the two .json files that I take the speakers from Amazon and the texts from Whisper?

Example from Amazon File:

[{"confidence":"0.6407","content":"dazu"}],"type":"pronunciation"},{"start_time":"1020.38","speaker_label":"spk_0","end_time":"1020. 93", "alternatives":[{"confidence": "1.0", "content": "tells"}], "type": "pronunciation"},{"speaker_label": "spk_0", "alternatives":[{"confidence": "0. 0","content":","}],"type":"punctuation"},{"start_time":"1020.93","speaker_label":"spk_0","end_time":"1021.23","alternatives":[{"confidence":"0. 5785","content":"dass"}],"type":"pronunciation"},{"start_time":"1021.24","speaker_label":"spk_0","end_time":"1021. 42","alternatives":[{"confidence":"0.5027","content":"das"}],"type":"pronunciation"},{"start_time":"1021.42","speaker_label":"spk_0","end_time":"1021. 64","alternatives":[{"confidence":"0.9825","content":"deine"}],"type":"pronunciation"},{"start_time":"1021. 64","speaker_label":"spk_0","end_time":"1021.91","alternatives":[{"confidence":"1.0","content":"mutter"}],"type":"pronunciation"},{"start_time":"1021. 91","speaker_label":"spk_0","end_time":"1022.22","alternatives":[{"confidence":"0.9509","content":"sagt"}],"type":"pronunciation"},
I did find someone on Github who somehow managed to do this, unfortunately I don't know a lot about programming.

Maybe you guys have some idea how to implement this.

Thanks a lot!

You could presumably look at turning time codes on for both services and then some fuzzy logic to match them up… might work so long as there is not much drift.

1 Like

I have seen some post about it before. I think it was a youtube video (not sure) where they discussed how they used Amazon to get the speakers and use timestamps to compare and merge the transcriptions with Whisper.

1 Like