How to get Whisper's API to add timestamps to the transcripts?

VonSander · November 14, 2023, 11:51am

Looking to timestamp every second of the transcript. Is this possible natively? Thanks in forward!

nikola1jankovic · November 22, 2023, 9:04am

At the moment, it is only possible to get timecodes within subtitle files (srt, vtt). If you want word alignment and timestamps, you would need to combine Whisper with some other alignment solutions - and as these models are built for each language separately, it complicates the integration a bit.

mnemic · December 14, 2023, 10:52pm

Is there a particular reason it’s not supported by the API? It’s built into the whisper model it seems. I’m getting it when running the model locally.

supershaneski · December 15, 2023, 2:21am

It is included in the API. Just set response_format parameter using srt or vtt.

const transcription = await openai.audio.transcriptions.create({
    file: fs.createReadStream("audio.mp3"),
    model: "whisper-1",
    response_format: "srt"
  });

See Reference page for more details

mnemic · January 15, 2024, 3:48pm

Thanks! This helped.

In case someone is looking for it, here’s the example code I ended up with, and another function to clean up the text into a [0:00:00] timestamp format.

# Function to transcribe audio using OpenAI's transcription service
def transcribe_audio(client, file_path):
    with open(file_path, "rb") as audio_file:
        transcription = client.audio.transcriptions.create(
            model="whisper-1", 
            file=audio_file, 
            response_format="srt"
        )
        # Pass the transcription directly for processing
        return process_transcription(transcription)
        #return response  # Directly return the response, assuming it's the transcription text

# Function to process the raw transcription into the desired format
def process_transcription(transcription):
    blocks = transcription.split('\n\n')
    processed_lines = []
    for block in blocks:
        lines = block.split('\n')
        if len(lines) >= 3:
            time_range = lines[1]
            text = lines[2]
            start_time = time_range.split(' --> ')[0]
            # Convert the time format from "00:00:00,000" to "0:00:00"
            formatted_start_time = format_time(start_time)
            processed_line = f"[{formatted_start_time}]{text}"
            processed_lines.append(processed_line)
    return '\n'.join(processed_lines)

ledjon · January 29, 2024, 10:06pm

Thanks @mnemic and @supershaneski. This was a very useful discusison/code-example for me as well.

Topic		Replies	Views
Whisper API: a) Timecodes; b) how good is open-source vs API? API whisper	9	5104	July 28, 2023
How can I get word_timestamp? API whisper	1	2494	December 14, 2023
Whisper API & Word-Level Time-stamping API whisper	6	15526	December 14, 2023
Speech To Text words details API whisper	2	674	December 14, 2023
How to retrieve transcription duration in minutes using Whisper with NodeJS and the OpenAI API? Community plugin-development , api , whisper	3	1323	December 6, 2023

How to get Whisper's API to add timestamps to the transcripts?

Related Topics