Best Practices for Maintaining Speaker Identity Across Chunks with gpt-4o-transcribe-diarize?

Martin_Gatzweiler · October 27, 2025, 4:30pm

Hey everyone,

I’m trying the new gpt-4o-transcribe-diarize model and running into the cross-chunk speaker identity question with large audio files. (with no known_speaker_references at the start)

My Question:

When processing audio files >25MB, I need to chunk them into multiple requests (acording to the documentation). Each chunk gets diarized independently with speaker labels like A, B, etc. However, these labels have no relationship across chunks:

Chunk 1 (0-10MB): A= Person A, B = Person B
Chunk 2 (10-25MB): A = Person B, B = Person A (labels flipped!)

My Questions

What’s the recommended approach for maintaining speaker continuity? Should I:
- Use overlapping chunks and match speakers in the overlap?
- Extract audio based on the timestamps from chunk 1’s response, and add them as “known_speaker_references”?
- Use a different chunking strategy altogether?
Has anyone successfully implemented this? What’s your workflow?

Thank you very much!

Blade1024 · October 31, 2025, 12:15pm

Hi Martin,

I am facing the same task - I have long recordings that I need to diarize. As you stated, the calling function includes a parameter, extra_body, that can provide known speakers to the model. The requirements are that they should be between 2–10 seconds in any input format supported by the main audio upload. After the first chunk’s initial processing, it should be possible to:

identify the segments based on the timeframes within the chunk, which were also provided, along with the time windows where the voice snapshot can be taken.
to carve one of these out of the recording and provide it as an identifiable example to the next model call, so it can match the speaker correctly.

I am about to implement that and can share the result if you are interested.

osriver17 · November 2, 2025, 3:09pm

It looks like gpt-4o-transcribe-diarize adds speaker name resolution based on context understanding on top of whisper result which does transcription and pyannote which does speaker embedding clustering.
For long video or low-latency live streaming where chunking is applied, speaker diarization only works intra-chunk. As for inter-chunk speaker tracking, I eventually had to fine-tune my own model at edge on desktop.

Martin_Gatzweiler · November 4, 2025, 8:47am

Hello Blade!

Yes, that’s exactly how I imagined it might work, but I haven’t been able to find any documentation describing such a workflow.

I’d be very interested to see your results!

bsommerhalder · February 9, 2026, 10:59pm

Necro-thread maybe – but curious to hear how this went! I am currently using Whisper via Azure with Azure’s diarization through Bulk Transcription service, which can handle pretty much arbitrarily long audio files. Interested in switching over to 4o if this long-file diariazation is doable. /cc @osriver17

Topic		Replies	Views
Best solution for Whisper diarization/speaker labeling? API whisper	23	46441	April 9, 2026
Whisper ASR Model Skipping Chunks in Audio Transcription Community whisper , transcribe	1	897	May 20, 2025
Whisper API keeps returning empty transcript for videos longer than 30 minutes — stuck in production API api	1	159	May 3, 2026
Best practice for generating transcriptions from long audio files API	1	1210	October 31, 2025
GPT4.0-Transcribe—MAX 1500 SECONDS? API api	3	1203	July 4, 2025

Best Practices for Maintaining Speaker Identity Across Chunks with gpt-4o-transcribe-diarize?

Related topics