How to use whisper to handle long video?

Hello everyone, I currently want to use Whisper for speech synthesis in videos, but I’ve encountered a few issues.

  1. I am a Plus user, and I’ve used the paid API to split a video into one file per minute and then batch process it using the code below. However, the code inside uses “model=‘whisper-1’”. How can I modify it to use the latest Whisper v3?
  • from openai import OpenAI
  • client = OpenAI()
  • audio_file= open(“/path/to/file/german.mp3”, “rb”)
  • transcript = client.audio.translations.create(
  • model=“whisper-1”,
  • file=audio_file
  • )
  • 
    
  1. Why are some instances of Whisper shared for free on the internet, but I have to pay for using Whisper through the API?

  2. Does the current version of Whisper still have a limitation where each analysis can only process files up to 25MB?

If my 40-minute speech file is over 500MB, does that mean I have to split and process it in batches?

I used to be able to split the file into batches for processing, but there were issues with integrating the timestamps in the batched SRT file. How can I address the timestamp integration issue?

Thank you.

1 Like
  1. At present, you can only use whisper-1 for the transcription/translation API. However, whisper 3 is available free to use as python module. Check the whisper page on how to install in your computer.

  2. Probably they are using the python module.

  3. For the API, it seems still up to 25MB. I checked the whisper python module and cannot see any file size limitation.

1 Like

Thank you very much

I have visit GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

But I’m still not quite sure how to install it. Is it possible to install it using Python code? Are there any sample codes I can refer to?

Also, does using Whisper here mean there are no file size limitations?

Additionally, how do I find the names of different model types to call in the code, such as base, medium, large, and the latest V3?

Thank you.

It means you need to encode it to a voice format that doesn’t waste so much data.

Whisper is open source, meaning that it can be used or recoded by anyone, and is light enough to be run on a 4GB GPU or slowly on CPU.

A person could, for example, use cloud Google Tensor processor ASICs and transcribe 50x faster than OpenAI can.

https://huggingface.co/spaces/sanchit-gandhi/whisper-jax, or make an API for it.

Other varieties you can run on your own good hardware can offer by-the-word timestamps or be oriented towards video transcriptions.

OpenAI API runs whisper-v2-large, but could be v3-upgraded without you knowing, as the newly released model is the same size.

1 Like

To install whisper 3

pip install git+https://github.com/openai/whisper.git 

You also need ffmpeg installed on your system


# macos

brew install ffmpeg

# windows using chocolatey

choco install ffmpeg

# windows using scoop

scoop install ffmpeg

To use whisper using command line in your terminal

whisper audio1.mp3 audio2.mp3 --model medium
1 Like

Thank you for your reply. So, are you saying that it’s possible to convert audio files into node.js format?

Thank you for your reply. So, are you saying that I can simply open the terminal and install it directly? I’m using Windows 11 with Python 3.10.6, but I don’t have a dedicated GPU, only a CPU.

powershellCopy code

PS C:\User\XXX> python --version

Then, can I directly install the program you provided?

bashCopy code

pip install git+https://github.com/openai/whisper.git

For this line, I have some questions:

bashCopy code

whisper audio1.mp3 audio2.mp3 --model medium

Can it process two audio files at once? And it seems to use “–model medium,” not “large V3.” Is that correct? Thank you.

There is no “node.js” format.

You know how mp3 can take a CD and make it 1/10th the size? That’s 25 year old technology now. Opus, as I gave an example of, has codecs optimized for voice, and by limiting the input to just phone call quality where voice audio lives, you can even improve the transcription.

With the many questions you have without a good foundation, starting with the OpenAI services would be a good start, although it doesn’t timestamp or make subtitle files.

1 Like

Yes, you can install it from the terminal. I have a very old mac and it can translate/transcribe audio files but of course it takes very long time and I cannot do almost real-time transcription. But it’s free, so I cannot complain lol.

You may need to update your python version as written in the github page. Then you can install it directly. Refer to the github page for complete installation procedure including if you hit a snag.

Yes, you can process more than one file at once. You are already using whisper 3 so just using “large” is enough. But like I mentioned previously, if you have a not so good system, maybe start using “tiny” first.

If you want to use this inside node.js application, just use exec:

import { exec } from 'child_process'

const command = `whisper './${filepath}' --language ${language} --temperature ${temperature} --model ${model} --output_dir '${outputDir}'`

exec(command, (error, stdout, stderr) => {
                
                if (error) {
                    console.log(error.message)
                    return
                }

               if (stderr) {
                    console.log(stderr)
                    return
                }

               console.log(stdout)
                
            })
1 Like

I have create a Python script that automates the process of generating subtitles for long video. I hope it can be useful for many of you!

What does it do?

  • Splits a Long Video: The script takes a long video file and splits it into 5-minute mp3 segments.
  • Generates Subtitles: Each mp3 segment is then processed using the OpenAI Whisper-1 API to generate accurate subtitles.
  • Merges Subtitles: Finally, the script combines all the subtitle files into a single .srt file that corresponds to the original video.

just run python generate_subtitle.py -i YOUR_VIDEO.mp4 in command line

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# title: generate_subtitle
# date: "2024-01-30"
# description: "Split a long video into small mp3 files, generate .srt files for each mp3, and then merge all content into a single srt file"
# author: Hsieh-Ting Lin, the Lizard 🦎

import argparse
import os
import tempfile
import time

from moviepy.editor import VideoFileClip
from openai import OpenAI


def split_video(video_path, temp_dir, duration=300):
    print("\033[92mStarting to split into mp3 files every 5 minutes!\033[0m")
    video = VideoFileClip(video_path)
    for i in range(0, int(video.duration), duration):
        chunk_filename = os.path.join(temp_dir, f"chunk_{i//duration}.mp3")
        clip = video.subclip(i, min(i + duration, int(video.duration)))
        clip.audio.write_audiofile(chunk_filename)


def generate_subtitles(temp_dir):
    # client = OpenAI(api_key="YOUR_API_KEY")
    client = OpenAI()
    files = [
        os.path.join(temp_dir, f) for f in os.listdir(temp_dir) if f.endswith(".mp3")
    ]
    all_subtitles = []
    print(
        "\033[92mConnecting to OpenAI Whisper-1 API to generate subtitle files\033[0m"
    )

    for index, file in enumerate(files):
        with open(file, "rb") as audio_file:
            print(f"\033[92mLoading {os.path.basename(file)}\033[0m")
            start_time = time.time()
            response = client.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file,
                response_format="srt",
            )
            if isinstance(response, str):
                offset = index * 300 * 1000  # 300 seconds in milliseconds
                adjusted_subtitle = adjust_timing(response, offset)
                all_subtitles.append(adjusted_subtitle)
                end_time = time.time()
                # Calculate and print the total time taken for the process
                total_time = end_time - start_time
                print(f"∟The whole process took {total_time:.1f} seconds.")
            else:
                print(f"Failed to transcribe {file}")
                print("Response:", response)  # Print the API response

    return all_subtitles


def adjust_timing(subtitle, offset):
    new_subtitle = []
    for line in subtitle.split("\n"):
        if "-->" in line:
            start, end = line.split(" --> ")
            new_start = adjust_time(start, offset)
            new_end = adjust_time(end, offset)
            new_subtitle.append(f"{new_start} --> {new_end}")
        else:
            new_subtitle.append(line)
    return "\n".join(new_subtitle)


def adjust_time(time_str, offset):
    hours, minutes, seconds_milliseconds = time_str.split(":")
    seconds, milliseconds = seconds_milliseconds.split(",")

    total_milliseconds = (
        int(milliseconds)
        + (int(seconds) + (int(minutes) + int(hours) * 60) * 60) * 1000
    )
    total_milliseconds += offset

    new_hours = total_milliseconds // 3600000
    total_milliseconds %= 3600000
    new_minutes = total_milliseconds // 60000
    total_milliseconds %= 60000
    new_seconds = total_milliseconds // 1000
    new_milliseconds = total_milliseconds % 1000

    return f"{new_hours:02}:{new_minutes:02}:{new_seconds:02},{new_milliseconds:03}"


def merge_subtitles(subtitles):
    combined_subtitles = "\n".join(subtitles)
    return reindex_subtitles(combined_subtitles)


def reindex_subtitles(srt_content):
    lines = srt_content.split("\n")
    new_content = []
    index = 1

    for line in lines:
        if line.isdigit():
            new_content.append(str(index))
            index += 1
        else:
            new_content.append(line)

    return "\n".join(new_content)


def main(video_path):
    video_filename = os.path.splitext(os.path.basename(video_path))[0]
    start_time = time.time()
    with tempfile.TemporaryDirectory() as temp_dir:
        split_video(video_path, temp_dir)
        subtitles = generate_subtitles(temp_dir)
        final_subtitles = merge_subtitles(subtitles)
        srt_filename = f"{video_filename}.srt"
        with open(srt_filename, "w", encoding="utf-8") as file:
            file.write(final_subtitles)
    print(
        f"\033[92m✔ Merging subtitle files into {os.path.basename(srt_filename)} completed!\033[0m"
    )
    end_time = time.time()
    total_time = end_time - start_time
    print(f"∟The whole process took {total_time:.1f} seconds.")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Generate subtitles for a video.")
    parser.add_argument("-i", "--input", help="Input video file", required=True)
    args = parser.parse_args()

    main(args.input)
3 Likes