How to split Transcribe() lines into shorter segments?

Is there a way to make the Transcribe() function produce smaller segments instead of one chunk? Example:

[00:00.000 --> 00:04.320] Meeting you was like listening to a song for the first time, and knowing it would be my favorite.
instead of this, I want something like

[00:00.000 --> 00:01.320]  Meeting you was like
[00:01.320 --> 00:02.270]  listening to a song
[00:02.270 --> 00:03.150]  for the first time,
[00:03.150 --> 00:04.320] and knowing it would be my favorite.

(Note that this was just an example and the actual transcription probably doesn’t look like this)
I saw another thread which asked the same question, however a solution that was provided didn’t work for me, so I’m asking it again. Thank you.

Whisper output is output in to segments.

Looking at an open-source timestamp library, we can see they aren’t offering much ability for tweaking except for a few model detection parameters, even if you do run it locally and under your control.

How about a best guess of those timestamps when splitting with code? Rewrote your specifications out for a mechanical brain. Then 25 minutes of making bot corrections on gpt-4-today to get plausible output.

analysis

Let’s walk through the program with the new input string:

text = “[00:00.000 → 00:04.320] The example text provided is split into chunks with the specified maximum word limit, and interpolated times are calculated based on the assumption of 150 words per minute.”

The split_transcript function is called with text as the argument.
The function uses a regular expression to extract the start and end times from the text, which are 0.0 and 4.32 respectively.
The function also uses a regular expression to extract all the words from the text. There are 24 words in the text.
The total time for the audio is calculated as the difference between the end time and the start time, which is 4.32 seconds. The time per word is then calculated as the total time divided by the number of words, which is approximately 0.18 seconds per word.
The words are divided into chunks of up to 7 words each. There are 4 chunks in total.
The start and end times for each chunk are calculated based on the number of words in the chunk and the time per word. The times are stored in a list of tuples.
The output string is constructed by joining the words in each chunk and formatting the start and end times.

The output of the program would be (ED: bot got this wrong):

[0.000 → 1.260] The example text provided is split
[1.260 → 2.520] into chunks with the specified maximum
[2.520 → 3.780] word limit, and interpolated times are
[3.780 → 4.320] calculated based on the assumption of…

The logic of the program seems to be correct, and it should produce the expected output for any conforming input. The output is a string where each line represents a chunk of words from the input text, with the start and end times for each chunk interpolated based on the total length of the audio and the number of words.

1 Like