Hey everyone,
I’m trying the new gpt-4o-transcribe-diarize model and running into the cross-chunk speaker identity question with large audio files. (with no known_speaker_references at the start)
My Question:
When processing audio files >25MB, I need to chunk them into multiple requests (acording to the documentation). Each chunk gets diarized independently with speaker labels like A, B, etc. However, these labels have no relationship across chunks:
- Chunk 1 (0-10MB): A= Person A, B = Person B
- Chunk 2 (10-25MB): A = Person B, B = Person A (labels flipped!)
My Questions
- What’s the recommended approach for maintaining speaker continuity? Should I:
- Use overlapping chunks and match speakers in the overlap?
- Extract audio based on the timestamps from chunk 1’s response, and add them as “known_speaker_references”?
- Use a different chunking strategy altogether?
- Has anyone successfully implemented this? What’s your workflow?
Thank you very much!