Best Practices for Maintaining Speaker Identity Across Chunks with gpt-4o-transcribe-diarize?

Hey everyone,

I’m trying the new gpt-4o-transcribe-diarize model and running into the cross-chunk speaker identity question with large audio files. (with no known_speaker_references at the start)

My Question:

When processing audio files >25MB, I need to chunk them into multiple requests (acording to the documentation). Each chunk gets diarized independently with speaker labels like A, B, etc. However, these labels have no relationship across chunks:

  • Chunk 1 (0-10MB): A= Person A, B = Person B
  • Chunk 2 (10-25MB): A = Person B, B = Person A (labels flipped!)

My Questions

  1. What’s the recommended approach for maintaining speaker continuity? Should I:
    • Use overlapping chunks and match speakers in the overlap?
    • Extract audio based on the timestamps from chunk 1’s response, and add them as “known_speaker_references”?
    • Use a different chunking strategy altogether?
  2. Has anyone successfully implemented this? What’s your workflow?

Thank you very much!

1 Like

Hi Martin,

I am facing the same task - I have long recordings that I need to diarize. As you stated, the calling function includes a parameter, extra_body, that can provide known speakers to the model. The requirements are that they should be between 2–10 seconds in any input format supported by the main audio upload. After the first chunk’s initial processing, it should be possible to:

  • identify the segments based on the timeframes within the chunk, which were also provided, along with the time windows where the voice snapshot can be taken.
  • to carve one of these out of the recording and provide it as an identifiable example to the next model call, so it can match the speaker correctly.

I am about to implement that and can share the result if you are interested.

1 Like