How to get Whisper's API to add timestamps to the transcripts?

Looking to timestamp every second of the transcript. Is this possible natively? Thanks in forward! :slight_smile:

At the moment, it is only possible to get timecodes within subtitle files (srt, vtt). If you want word alignment and timestamps, you would need to combine Whisper with some other alignment solutions - and as these models are built for each language separately, it complicates the integration a bit.

Is there a particular reason it’s not supported by the API? It’s built into the whisper model it seems. I’m getting it when running the model locally.

1 Like

It is included in the API. Just set response_format parameter using srt or vtt.

const transcription = await openai.audio.transcriptions.create({
    file: fs.createReadStream("audio.mp3"),
    model: "whisper-1",
    response_format: "srt"
  });

See Reference page for more details

4 Likes

Thanks! This helped.

In case someone is looking for it, here’s the example code I ended up with, and another function to clean up the text into a [0:00:00] timestamp format.

# Function to transcribe audio using OpenAI's transcription service
def transcribe_audio(client, file_path):
    with open(file_path, "rb") as audio_file:
        transcription = client.audio.transcriptions.create(
            model="whisper-1", 
            file=audio_file, 
            response_format="srt"
        )
        # Pass the transcription directly for processing
        return process_transcription(transcription)
        #return response  # Directly return the response, assuming it's the transcription text

# Function to process the raw transcription into the desired format
def process_transcription(transcription):
    blocks = transcription.split('\n\n')
    processed_lines = []
    for block in blocks:
        lines = block.split('\n')
        if len(lines) >= 3:
            time_range = lines[1]
            text = lines[2]
            start_time = time_range.split(' --> ')[0]
            # Convert the time format from "00:00:00,000" to "0:00:00"
            formatted_start_time = format_time(start_time)
            processed_line = f"[{formatted_start_time}]{text}"
            processed_lines.append(processed_line)
    return '\n'.join(processed_lines)
1 Like

Thanks @mnemic and @supershaneski. This was a very useful discusison/code-example for me as well.

1 Like