Timings offset for long audio in whisperX

I am using WhisperX and facing this trouble: sometimes transcripts can be with offset from the start or it can skip the lines, it really messes with postprocessing. Has someone encountered such a trouble? How to fix it?

1 Like