Wondering what the state of the art is for diarization using Whisper, or if OpenAI has revealed any plans for native implementations in the pipeline. I’ve found some that can run locally, but ideally I’d still be able to use the API for speed and convenience.
Google Cloud Speech-to-Text has built-in diarization, but I’d rather keep my tech stack all OpenAI if I can, and believe Whisper is better regardless.
Yes, this is a shortcoming. I have tried using Whisper in combination with Pyannote, but the result is a bit complicated to implement, plus the results are not even close to ideal.
This library picked up a bunch of steam, haven’t used yet but everything I’ve read and seen looks pretty amazing. Runs locally, doesn’t use API, but seems to be especially fast.
So all diarization can be sketchy, it’s one of those areas where the shiny marketing material definitely does not match reality.
However, what I have noticed gives really unusually good results is gpt-4o, when instructed to make a best guess as to different speakers based on conversation style. People use very different words and phrases and gpt-4o has given some really good results with a prompt like this:
"there are three speakers in this transcription, make a best guess, based on speaking styles, at diarising it: … "
It works even better if you give the names of the people and indicate what one of their views in the conversation is (giving it something to get a foothold on)
do you have access to gpt-4o with audio input? how about audio output? can you tell us more about your experience using it? us mere mortals can only use text and image input and text output for now. i’m interested how audio input and audio output works.
I’m doing the same thing as you, nothing above moral level
I’m using whisper to create a general, non diarised transcript then I’m feeding that transcript into gpt-4o with the prompt and hints and it does really well at guessing at different people.
I then present that to the user in a UI for final checks and changes
Could you explain in more detail how to ask GPT-4o to transcribe audio and identify speakers? I have successfully transcribed my audio into text, but when I pass it to GPT, it attempts to use Python code to complete the task, which is unsuccessful.
Also, I’ve been using an Assistant with GPT-4o, JSON output but won’t do the full transcription into multiple speakers.
Interesting… If I understand correctly Falcon appears to rely on the Whisper segmentation and trying to assign the most likely speaker to each segment. If this is the case does it not depend on how accurate is the segmentation is in the first place, because most segmentations are based on silence gaps between utterances. What will happen to the partially overlapped speech?
I used the Azure Batch Speech to Text. for the Dutch language it uses Whisper with the diarization. the results are pretty good but note that Azure is pretty expensive and the use of it is pretty hard.