I have many audio files where two people are speaking that need to be transcribed. I tried it with Whisper from OpenAI, which works perfectly. Unfortunately, Whisper can’t distinguish between 2 speakers.
Now, I have tried Amazon Transcribe. Amazon can distinguish speakers, but is much worse at transcribing than Whisper.
Is there any way I can “merge” the two .json files that I take the speakers from Amazon and the texts from Whisper?
Example from Amazon File:
[{"confidence":"0.6407","content":"dazu"}],"type":"pronunciation"},{"start_time":"1020.38","speaker_label":"spk_0","end_time":"1020. 93", "alternatives":[{"confidence": "1.0", "content": "tells"}], "type": "pronunciation"},{"speaker_label": "spk_0", "alternatives":[{"confidence": "0. 0","content":","}],"type":"punctuation"},{"start_time":"1020.93","speaker_label":"spk_0","end_time":"1021.23","alternatives":[{"confidence":"0. 5785","content":"dass"}],"type":"pronunciation"},{"start_time":"1021.24","speaker_label":"spk_0","end_time":"1021. 42","alternatives":[{"confidence":"0.5027","content":"das"}],"type":"pronunciation"},{"start_time":"1021.42","speaker_label":"spk_0","end_time":"1021. 64","alternatives":[{"confidence":"0.9825","content":"deine"}],"type":"pronunciation"},{"start_time":"1021. 64","speaker_label":"spk_0","end_time":"1021.91","alternatives":[{"confidence":"1.0","content":"mutter"}],"type":"pronunciation"},{"start_time":"1021. 91","speaker_label":"spk_0","end_time":"1022.22","alternatives":[{"confidence":"0.9509","content":"sagt"}],"type":"pronunciation"},
I did find someone on Github who somehow managed to do this, unfortunately I don't know a lot about programming.
Maybe you guys have some idea how to implement this.
Thanks a lot!