How to sync dual-channel transcripts via OpenAI Whisper (VAD silence stripping destroys absolute timestamps)

I am building an automated call transcription pipeline for a PBX system. The goal is to generate a perfectly chronological, multi-speaker transcript (Caller vs. Callee) from standard 8kHz telephony audio.
(My Attempted Solution) Because the OpenAI API downmixes stereo files to mono (which destroys speaker separation and causes heavy hallucination on 8kHz audio), I built a split-channel architecture:

  1. Asterisk: I use MixMonitor with the b,r(),t() flags to record the call legs into two separate, mathematically synchronized files (_caller.wav and _callee.wav).

  2. PHP Worker: A background script converts the files and fires two separate cURL requests to the Whisper API, requesting verbose_json to get exact timestamps.

  3. The Merge: The PHP script parses both JSON arrays, tags the speakers, merges the arrays, and sorts them chronologically by their start times to reconstruct the conversation.

The Specific Issue I am Facing
getting jumbled transcription
the transcription i am getting:
[00:00] Caller: Hello, this is a Policy Test, my name is John Miller, today is Wednesday, May 27th, the

[00:00] Callee: Hi, if you record your name and reason for calling, I’ll see if this person is available.

[00:15] Caller: reference number is 473169, can you hear me clearly?

[00:24] Callee: Yes, I can hear you clearly.

[00:26] Callee: This is the Kohli site test.

[00:28] Callee: My name is Sarah Johnson.

[00:30] Callee: The audio quality sounds good from my side.

[00:33] Callee: Please continue with the verification.
[00:35] Caller: I will now test timestamps and speaker changes, the amount is $125, the meeting is scheduled

[00:43] Caller: for 10.30am, please confirm the details.

[00:48] Callee: Confirmed.

[00:49] Callee: $125.

[00:51] Callee: Meeting at 10.30 AM.

[00:53] Callee: I am also testing punctuation, pauses, and pronunciation.

[00:58] Caller: Now testing short interruptions, can you just say the color blue while I continue speaking?

[01:05] Callee: Blue.

[01:07] Caller: Thank you, now testing phone numbers 9876543210, final verification test, this call recording

[01:17] Callee: Received.

[01:18] Callee: Now testing email pronunciation.

[01:20] Callee: john.miller at example dot com

[01:26] Caller: should contain timestamps, speaker labels and accurate English transcriptions, ending

[01:32] Caller: test now.
the actual script of the test call i made:
Caller

Hello, this is the caller side test.

My name is John Miller.

Today is Wednesday, May twenty seventh.

The reference number is four seven three one six nine.

Can you hear me clearly?

Callee

Yes, I can hear you clearly.

This is the callee side test.

My name is Sarah Johnson.

The audio quality sounds good from my side.

Please continue with the verification.

Caller

I will now test timestamps and speaker changes.

The amount is one hundred twenty five dollars.

The meeting is scheduled for ten thirty AM.
Callee

Confirmed.

One hundred twenty five dollars.

Meeting at ten thirty AM.

I am also testing punctuation, pauses, and pronunciation.

Caller

Now testing short interruptions.

Can you say the color blue while I continue speaking?

Callee (interrupt slightly)

Blue.
Caller

Thank you.

Now testing phone numbers.

Nine eight seven six five four three two one zero.

Callee

Received.

Now testing email pronunciation.

john dot miller at example dot com.

Caller

Final verification test.

This call recording should contain timestamps,

speaker labels,
and accurate English transcription.

Ending test now.

AGI and asterisk experts please help if any solution from AGI side possible to this problem

Well honestly your split-channel approach is actually pretty solid already :sob: the timestamps/speaker reconstruction logic looks mostly correct to me.

The bigger issue feels more like Whisper struggling with :wink:

  • narrowband 8kHz telephony audio

  • overlapping speech/interruption handling

  • and context reconstruction across independently transcribed channels.

A few things stand out from your output tho :face_with_monocle:

  • “Policy Test” instead of “caller side test”

  • numbers normalized weirdly

  • sentence continuation split oddly across timestamps

  • interruption timing drift around the “blue” overlap

So that usually looks more like ASR inference limitations than AGI/Asterisk sync problems.

One thing I’d seriously test :thinking:

  • upsample audio to 16kHz before transcription (sox or ffmpeg)

  • even though no new information is created, Whisper tends to behave noticeably better on resampled telephony audio.

Also maybe try :thinking:

  • adding small silence padding at start of both legs before transcription

  • forcing shorter segments/VAD chunking

  • aligning merged segments by midpoint timestamps instead of raw start times only.

Hmm.. Your actual synchronization pipeline honestly seems cleaner than most PBX transcription setups I’ve ever seen :sob::broken_heart:

i am actually kind of stuck and dont know exactly how to resolve this issue with the above constraints.Thanks for the suggestions tho