Hey everyone,
I’m trying the new gpt-4o-transcribe-diarize model and running into the cross-chunk speaker identity question with large audio files. (with no known_speaker_references at the start)
My Question:
When processing audio files >25MB, I need to chunk them into multiple requests (acording to the documentation). Each chunk gets diarized independently with speaker labels like A, B, etc. However, these labels have no relationship across chunks:
- Chunk 1 (0-10MB): A= Person A, B = Person B
- Chunk 2 (10-25MB): A = Person B, B = Person A (labels flipped!)
My Questions
- What’s the recommended approach for maintaining speaker continuity? Should I:
- Use overlapping chunks and match speakers in the overlap?
- Extract audio based on the timestamps from chunk 1’s response, and add them as “known_speaker_references”?
- Use a different chunking strategy altogether?
- Has anyone successfully implemented this? What’s your workflow?
Thank you very much!
1 Like
Hi Martin,
I am facing the same task - I have long recordings that I need to diarize. As you stated, the calling function includes a parameter, extra_body, that can provide known speakers to the model. The requirements are that they should be between 2–10 seconds in any input format supported by the main audio upload. After the first chunk’s initial processing, it should be possible to:
- identify the segments based on the timeframes within the chunk, which were also provided, along with the time windows where the voice snapshot can be taken.
- to carve one of these out of the recording and provide it as an identifiable example to the next model call, so it can match the speaker correctly.
I am about to implement that and can share the result if you are interested.
3 Likes
It looks like gpt-4o-transcribe-diarize adds speaker name resolution based on context understanding on top of whisper result which does transcription and pyannote which does speaker embedding clustering.
For long video or low-latency live streaming where chunking is applied, speaker diarization only works intra-chunk. As for inter-chunk speaker tracking, I eventually had to fine-tune my own model at edge on desktop.
Hello Blade!
Yes, that’s exactly how I imagined it might work, but I haven’t been able to find any documentation describing such a workflow.
I’d be very interested to see your results!