How to use whisper to handle long video?

jason123 · November 28, 2023, 6:30am

Hello everyone, I currently want to use Whisper for speech synthesis in videos, but I’ve encountered a few issues.

I am a Plus user, and I’ve used the paid API to split a video into one file per minute and then batch process it using the code below. However, the code inside uses “model=‘whisper-1’”. How can I modify it to use the latest Whisper v3?

from openai import OpenAI
client = OpenAI()
audio_file= open(“/path/to/file/german.mp3”, “rb”)
transcript = client.audio.translations.create(
model=“whisper-1”,
file=audio_file
)

Why are some instances of Whisper shared for free on the internet, but I have to pay for using Whisper through the API?
Does the current version of Whisper still have a limitation where each analysis can only process files up to 25MB?

If my 40-minute speech file is over 500MB, does that mean I have to split and process it in batches?

I used to be able to split the file into batches for processing, but there were issues with integrating the timestamps in the batched SRT file. How can I address the timestamp integration issue?

Thank you.

supershaneski · November 28, 2023, 6:40am

At present, you can only use whisper-1 for the transcription/translation API. However, whisper 3 is available free to use as python module. Check the whisper page on how to install in your computer.
Probably they are using the python module.
For the API, it seems still up to 25MB. I checked the whisper python module and cannot see any file size limitation.

jason123 · November 28, 2023, 7:14am

Thank you very much

I have visit GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

But I’m still not quite sure how to install it. Is it possible to install it using Python code? Are there any sample codes I can refer to?

Also, does using Whisper here mean there are no file size limitations?

Additionally, how do I find the names of different model types to call in the code, such as base, medium, large, and the latest V3?

Thank you.

_j · November 28, 2023, 7:20am

It means you need to encode it to a voice format that doesn’t waste so much data.

https://community.openai.com/t/sending-an-hours-worth-of-audio-through-whisper-using-node-js/450869/8

Whisper is open source, meaning that it can be used or recoded by anyone, and is light enough to be run on a 4GB GPU or slowly on CPU.

A person could, for example, use cloud Google Tensor processor ASICs and transcribe 50x faster than OpenAI can.

https://huggingface.co/spaces/sanchit-gandhi/whisper-jax, or make an API for it.

Other varieties you can run on your own good hardware can offer by-the-word timestamps or be oriented towards video transcriptions.

OpenAI API runs whisper-v2-large, but could be v3-upgraded without you knowing, as the newly released model is the same size.

supershaneski · November 28, 2023, 7:22am

To install whisper 3

pip install git+https://github.com/openai/whisper.git

You also need ffmpeg installed on your system


# macos

brew install ffmpeg

# windows using chocolatey

choco install ffmpeg

# windows using scoop

scoop install ffmpeg

To use whisper using command line in your terminal

whisper audio1.mp3 audio2.mp3 --model medium

jason123 · November 28, 2023, 7:47am

Thank you for your reply. So, are you saying that it’s possible to convert audio files into node.js format?

jason123 · November 28, 2023, 7:53am

Thank you for your reply. So, are you saying that I can simply open the terminal and install it directly? I’m using Windows 11 with Python 3.10.6, but I don’t have a dedicated GPU, only a CPU.

powershellCopy code

PS C:\User\XXX> python --version

Then, can I directly install the program you provided?

bashCopy code

pip install git+https://github.com/openai/whisper.git

For this line, I have some questions:

bashCopy code

whisper audio1.mp3 audio2.mp3 --model medium

Can it process two audio files at once? And it seems to use “–model medium,” not “large V3.” Is that correct? Thank you.

_j · November 28, 2023, 7:56am

There is no “node.js” format.

You know how mp3 can take a CD and make it 1/10th the size? That’s 25 year old technology now. Opus, as I gave an example of, has codecs optimized for voice, and by limiting the input to just phone call quality where voice audio lives, you can even improve the transcription.

With the many questions you have without a good foundation, starting with the OpenAI services would be a good start, although it doesn’t timestamp or make subtitle files.

supershaneski · November 28, 2023, 8:19am

Yes, you can install it from the terminal. I have a very old mac and it can translate/transcribe audio files but of course it takes very long time and I cannot do almost real-time transcription. But it’s free, so I cannot complain lol.

You may need to update your python version as written in the github page. Then you can install it directly. Refer to the github page for complete installation procedure including if you hit a snag.

Yes, you can process more than one file at once. You are already using whisper 3 so just using “large” is enough. But like I mentioned previously, if you have a not so good system, maybe start using “tiny” first.

If you want to use this inside node.js application, just use exec:

import { exec } from 'child_process'

const command = `whisper './${filepath}' --language ${language} --temperature ${temperature} --model ${model} --output_dir '${outputDir}'`

exec(command, (error, stdout, stderr) => {
                
                if (error) {
                    console.log(error.message)
                    return
                }

               if (stderr) {
                    console.log(stderr)
                    return
                }

               console.log(stdout)
                
            })

htlin · January 30, 2024, 8:41am

I have create a Python script that automates the process of generating subtitles for long video. I hope it can be useful for many of you!

What does it do?

Splits a Long Video: The script takes a long video file and splits it into 5-minute mp3 segments.
Generates Subtitles: Each mp3 segment is then processed using the OpenAI Whisper-1 API to generate accurate subtitles.
Merges Subtitles: Finally, the script combines all the subtitle files into a single .srt file that corresponds to the original video.

just run python generate_subtitle.py -i YOUR_VIDEO.mp4 in command line

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# title: generate_subtitle
# date: "2024-01-30"
# description: "Split a long video into small mp3 files, generate .srt files for each mp3, and then merge all content into a single srt file"
# author: Hsieh-Ting Lin, the Lizard 🦎

import argparse
import os
import tempfile
import time

from moviepy.editor import VideoFileClip
from openai import OpenAI


def split_video(video_path, temp_dir, duration=300):
    print("\033[92mStarting to split into mp3 files every 5 minutes!\033[0m")
    video = VideoFileClip(video_path)
    for i in range(0, int(video.duration), duration):
        chunk_filename = os.path.join(temp_dir, f"chunk_{i//duration}.mp3")
        clip = video.subclip(i, min(i + duration, int(video.duration)))
        clip.audio.write_audiofile(chunk_filename)


def generate_subtitles(temp_dir):
    # client = OpenAI(api_key="YOUR_API_KEY")
    client = OpenAI()
    files = [
        os.path.join(temp_dir, f) for f in os.listdir(temp_dir) if f.endswith(".mp3")
    ]
    all_subtitles = []
    print(
        "\033[92mConnecting to OpenAI Whisper-1 API to generate subtitle files\033[0m"
    )

    for index, file in enumerate(files):
        with open(file, "rb") as audio_file:
            print(f"\033[92mLoading {os.path.basename(file)}\033[0m")
            start_time = time.time()
            response = client.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file,
                response_format="srt",
            )
            if isinstance(response, str):
                offset = index * 300 * 1000  # 300 seconds in milliseconds
                adjusted_subtitle = adjust_timing(response, offset)
                all_subtitles.append(adjusted_subtitle)
                end_time = time.time()
                # Calculate and print the total time taken for the process
                total_time = end_time - start_time
                print(f"∟The whole process took {total_time:.1f} seconds.")
            else:
                print(f"Failed to transcribe {file}")
                print("Response:", response)  # Print the API response

    return all_subtitles


def adjust_timing(subtitle, offset):
    new_subtitle = []
    for line in subtitle.split("\n"):
        if "-->" in line:
            start, end = line.split(" --> ")
            new_start = adjust_time(start, offset)
            new_end = adjust_time(end, offset)
            new_subtitle.append(f"{new_start} --> {new_end}")
        else:
            new_subtitle.append(line)
    return "\n".join(new_subtitle)


def adjust_time(time_str, offset):
    hours, minutes, seconds_milliseconds = time_str.split(":")
    seconds, milliseconds = seconds_milliseconds.split(",")

    total_milliseconds = (
        int(milliseconds)
        + (int(seconds) + (int(minutes) + int(hours) * 60) * 60) * 1000
    )
    total_milliseconds += offset

    new_hours = total_milliseconds // 3600000
    total_milliseconds %= 3600000
    new_minutes = total_milliseconds // 60000
    total_milliseconds %= 60000
    new_seconds = total_milliseconds // 1000
    new_milliseconds = total_milliseconds % 1000

    return f"{new_hours:02}:{new_minutes:02}:{new_seconds:02},{new_milliseconds:03}"


def merge_subtitles(subtitles):
    combined_subtitles = "\n".join(subtitles)
    return reindex_subtitles(combined_subtitles)


def reindex_subtitles(srt_content):
    lines = srt_content.split("\n")
    new_content = []
    index = 1

    for line in lines:
        if line.isdigit():
            new_content.append(str(index))
            index += 1
        else:
            new_content.append(line)

    return "\n".join(new_content)


def main(video_path):
    video_filename = os.path.splitext(os.path.basename(video_path))[0]
    start_time = time.time()
    with tempfile.TemporaryDirectory() as temp_dir:
        split_video(video_path, temp_dir)
        subtitles = generate_subtitles(temp_dir)
        final_subtitles = merge_subtitles(subtitles)
        srt_filename = f"{video_filename}.srt"
        with open(srt_filename, "w", encoding="utf-8") as file:
            file.write(final_subtitles)
    print(
        f"\033[92m✔ Merging subtitle files into {os.path.basename(srt_filename)} completed!\033[0m"
    )
    end_time = time.time()
    total_time = end_time - start_time
    print(f"∟The whole process took {total_time:.1f} seconds.")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Generate subtitles for a video.")
    parser.add_argument("-i", "--input", help="Input video file", required=True)
    args = parser.parse_args()

    main(args.input)

Topic		Replies	Views
How to transcribe long audio to srt file directly? API whisper	3	4221	December 16, 2023
Questions regarding transcribing long audios (>25MB) in Whisper API API api , whisper	8	10204	December 15, 2023
Send an hours worth of audio through Whisper using node.js API	7	220	December 11, 2023
How to write a Python script for the new version of OpenAI Whisper API? API api	0	1664	March 21, 2024
All my attempts to improve accuracy and reduce hallucinations have the opposite effect! API whisper , hallucinations	6	552	October 31, 2024

How to use whisper to handle long video?

Related topics