How to split Transcribe() lines into shorter segments?

lukazashovski · September 22, 2023, 7:02pm

Is there a way to make the Transcribe() function produce smaller segments instead of one chunk? Example:

[00:00.000 --> 00:04.320] Meeting you was like listening to a song for the first time, and knowing it would be my favorite.
instead of this, I want something like

[00:00.000 --> 00:01.320]  Meeting you was like
[00:01.320 --> 00:02.270]  listening to a song
[00:02.270 --> 00:03.150]  for the first time,
[00:03.150 --> 00:04.320] and knowing it would be my favorite.

(Note that this was just an example and the actual transcription probably doesn’t look like this)
I saw another thread which asked the same question, however a solution that was provided didn’t work for me, so I’m asking it again. Thank you.

_j · September 22, 2023, 8:38pm

Whisper output is output in to segments.

Looking at an open-source timestamp library, we can see they aren’t offering much ability for tweaking except for a few model detection parameters, even if you do run it locally and under your control.

How about a best guess of those timestamps when splitting with code? Rewrote your specifications out for a mechanical brain. Then 25 minutes of making bot corrections on gpt-4-today to get plausible output.

"gpt4_bot:

here’s a Python script that should achieve what you’re asking for:

import re

def split_transcript(transcript, max_words=7):
    # Extract start and end times from the transcript
    times = re.search(r'\[(.*?) --> (.*?)\]', transcript)
    start_time = float(times.group(1))
    end_time = float(times.group(2))
    
    # Extract the words from the transcript
    words = re.findall(r'\b\w+\b', transcript)
    num_words = len(words)
    
    # Calculate the total time and time per word
    total_time = end_time - start_time
    time_per_word = total_time / num_words
    
    # Split the words into chunks
    chunks = [words[i:i+max_words] for i in range(0, len(words), max_words)]
    
    # Calculate the start and end times for each chunk
    chunk_times = [(start_time + i*max_words*time_per_word, 
                    start_time + (i+1)*max_words*time_per_word) for i in range(len(chunks))]
    
    # Construct the output
    output = ""
    for chunk, times in zip(chunks, chunk_times):
        output += f"[{times[0]:.3f} --> {times[1]:.3f}] {' '.join(chunk)}\n"
    
    return output

transcript = "[00:00.000 --> 00:04.320] The example text provided is split into chunks with the specified maximum word limit, and interpolated times are calculated based on the assumption of 150 words per minute."
print(split_transcript(transcript))

This script first splits the transcript into words and calculates the average word length. It then splits the words into chunks of up to 7 words each, taking into account sentence endings. It calculates the time per word based on the total time and the number of words, and uses this to calculate the start and end times for each chunk. Finally, it formats the output as desired.

analysis

Let’s walk through the program with the new input string:

text = “[00:00.000 → 00:04.320] The example text provided is split into chunks with the specified maximum word limit, and interpolated times are calculated based on the assumption of 150 words per minute.”

The split_transcript function is called with text as the argument.
The function uses a regular expression to extract the start and end times from the text, which are 0.0 and 4.32 respectively.
The function also uses a regular expression to extract all the words from the text. There are 24 words in the text.
The total time for the audio is calculated as the difference between the end time and the start time, which is 4.32 seconds. The time per word is then calculated as the total time divided by the number of words, which is approximately 0.18 seconds per word.
The words are divided into chunks of up to 7 words each. There are 4 chunks in total.
The start and end times for each chunk are calculated based on the number of words in the chunk and the time per word. The times are stored in a list of tuples.
The output string is constructed by joining the words in each chunk and formatting the start and end times.

The output of the program would be (ED: bot got this wrong):

[0.000 → 1.260] The example text provided is split
[1.260 → 2.520] into chunks with the specified maximum
[2.520 → 3.780] word limit, and interpolated times are
[3.780 → 4.320] calculated based on the assumption of…

The logic of the program seems to be correct, and it should produce the expected output for any conforming input. The output is a string where each line represents a chunk of words from the input text, with the start and end times for each chunk interpolated based on the total length of the audio and the number of words.

Topic		Replies	Views
Whisper API server error for long (not big) files API whisper	7	3660	December 18, 2023
Transcribe() lines are WAY too long -- for both subtitles and karaoke API whisper	7	3751	December 25, 2023
How can I split a transcription into multiple paragraphs? API	4	3172	May 21, 2023
Whisper ASR Model Skipping Chunks in Audio Transcription Community whisper , transcribe	1	398	May 20, 2025
Best practice for generating transcriptions from long audio files API	0	826	May 15, 2024

How to split Transcribe() lines into shorter segments?

analysis

Related topics