Project: Running your own Whisper-Large-v3 model and extract Audio Embeddings

I thought I’d start this project thread on running your own OpenAI model ‘whisper-large-v3’. In addition, I want to show how to “hack” the model to also extract the internals of the model to acquire an embedding vector of the audio file directly.

If you have any questions as I show how to do this, feel free to chime in.

7 Likes

For this thread, I am going in “production first”, so will be using various public cloud and GPU resources that anyone can use. Then as a fallback (for the hobbyist) I’d like to do a local Docker version CPU only version. This would be for, say long YouTube videos or something you are willing to crunch in the background.

As for the embedding vector, this is illustrative of how embeddings are made, and in this case I am going to create it my mean-pooling the hidden layers of the model to create a 1280 dim vector, that I will also normalize, so it’s ready for cosine similarity, via simple dot products, right out of the gate.

With the embeddings, I was planning a simple demonstration of classifying with a kNN voting scheme, also show how to use FAISS with Flat and HNSW. Possibly even throw in vectorized numpy to do Flat without FAISS library overhead as well.

The reason for using this embedding vector, vs. embedding the translated text will be discussed and hopefully become obvious why these are and should be different.

Prep to start off today.

For “production”, not local/hobbyist, get a HuggingSpace account. We will be running serverless on an NVIDIA H200. We will be using their ZeroSpaces and their CI/CD pipelines to build and deploy as well. You will get your own private API key and we will hack together to fight through the python conflict drama to shape this how we want.

In the end, you will have an endpoint that responds just like the Whisper OpenAI endpoint, and you also get back the audio embedding vector, something not (currently) offered in the OpenAI API.

2 Likes

I’ll sneak in here. Find a GPU 6 times slower - and run for English only. GitHub - huggingface/distil-whisper: Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.

1 Like

OK, so today I will post the main code files, and describe how to get this working.

The quickest and easiest (and maybe cheapest) way to get this going on a GPU in the cloud, and rendering at ~15x real-time, is to pay HF the $9 a month for Pro. This allows you to clone the openai/whisper-large-v3 repo into your own ZeroSpace, and get going “quickly” (caveats in a minute). I will have a free way to do this as well for the hobbyist, who don’t need the insane speedup that a hefty GPU can provide. But this paid method will yield your own private endpoint that will transcribe 15 to 55 hours of audio per day, quickly too, for a flat monthly fee.

So go to openai/whisper-large-v3 in :hugs:

Then click “Deploy” and in the drop down, select “Spaces”.

From here, clone the repo, and boom, you have a starting CI/CD pipeline.

It won’t work, unfortunately (this is the caveat). As an exercise, you can try getting it to work, and you will find all sorts of python conflicts, in particular, conflicts with the numpy version and PyTorch versions, assuming you are shooting for the latest, vs. what the container inherits from HF, which is an old version of PyTorch. :frowning:

Good news, I spent three hours doing this dance, and have magical files that will work :sweat_smile:

So in your CI/CD, edit you “app.py” to have these contents:

import spaces
import torch
from fastapi import FastAPI, UploadFile, File, Form
from pydantic import BaseModel
import tempfile
import os
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from transformers.pipelines.audio_utils import ffmpeg_read
import yt_dlp
from fastapi.responses import PlainTextResponse

app = FastAPI()

MODEL_NAME = "openai/whisper-large-v3"

# -------------------------------------------------------------
# ZeroGPU‑compatible lazy loader running on an H200 slice
# -------------------------------------------------------------

@spaces.GPU  # HF scheduler looks for this
def whisper_infer(waveform, task="transcribe"):
    """Return (text, unit_embedding) for a 16‑kHz mono waveform."""
    if not hasattr(whisper_infer, "model"):
        device = "cuda"
        processor = AutoProcessor.from_pretrained(MODEL_NAME)
        model = (
            AutoModelForSpeechSeq2Seq
            .from_pretrained(MODEL_NAME, output_hidden_states=True)
            .to(device)
            .eval()
        )
        whisper_infer.processor = processor
        whisper_infer.model = model

    processor = whisper_infer.processor
    model = whisper_infer.model
    device = next(model.parameters()).device

    inputs = processor(
        waveform.squeeze(), sampling_rate=16_000, return_tensors="pt"
    ).to(device)
    input_features = inputs["input_features"]  # (B, 80, T)

    with torch.no_grad():
        # --- encoder pass for embedding ---
        enc_out = model.model.encoder(
            input_features=input_features, output_hidden_states=True
        )
        hidden = enc_out.hidden_states[-1]  # (B, T, D)
        emb = torch.nn.functional.normalize(
            hidden.mean(dim=1), p=2, dim=-1
        ).squeeze().cpu().tolist()

        # --- generate transcription ---
        gen_ids = model.generate(**inputs)
        text = processor.batch_decode(gen_ids, skip_special_tokens=True)[0]

    return text, emb

# -------------------------------------------------------------
# Helper: decode any audio file to 16‑kHz mono tensor
# -------------------------------------------------------------

def decode_audio(path: str) -> torch.Tensor:
    try:
        wav, sr = torchaudio.load(path)
    except Exception:
        with open(path, "rb") as f:
            raw = f.read()
        arr = ffmpeg_read(raw, 16000)  # returns np.ndarray (T,)
        wav = torch.from_numpy(arr).unsqueeze(0)
        sr = 16000

    if sr != 16000:
        wav = torchaudio.functional.resample(wav, sr, 16000)
    if wav.shape[0] > 1:
        wav = wav.mean(dim=0, keepdim=True)
    return wav

# -------------------------------------------------------------
# FastAPI routes
# -------------------------------------------------------------

@app.get("/", response_class=PlainTextResponse)
def root():
    return (
        "Whisper‑Large‑v3 inference Space (ZeroGPU).\n"
        "POST /transcribe  – multipart file 'audio', field 'task'\n"
        "POST /yt_transcribe – JSON {\"url\": …, \"task\": …}\n"
        "Swagger UI: /docs"
    )

@app.post("/transcribe")
async def transcribe(audio: UploadFile = File(...), task: str = Form("transcribe")):
    with tempfile.NamedTemporaryFile(delete=False, suffix=".audio") as tmp:
        tmp.write(await audio.read())
        tmp_path = tmp.name

    waveform = decode_audio(tmp_path)
    os.remove(tmp_path)

    text, embedding = whisper_infer(waveform, task)
    return {"text": text, "embedding": embedding}

class YTRequest(BaseModel):
    url: str
    task: str = "transcribe"

@app.post("/yt_transcribe")
def yt_transcribe(req: YTRequest):
    with tempfile.TemporaryDirectory() as tmpdir:
        fp = os.path.join(tmpdir, "yt_audio.m4a")
        yt_opts = {"format": "bestaudio/best", "outtmpl": fp, "quiet": True}
        yt_dlp.YoutubeDL(yt_opts).download([req.url])
        waveform = decode_audio(fp)

    text, embedding = whisper_infer(waveform, req.task)
    return {"text": text, "embedding": embedding}

# -------------------------------------------------------------
# Local dev entry‑point (ignored by HF runtime)
# -------------------------------------------------------------

if __name__ == "__main__":
    import uvicorn
    port = int(os.environ.get("PORT", 7860))
    uvicorn.run("app:app", host="0.0.0.0", port=port, workers=1)

Also, modify your “requirements.txt” to this:

# requirements.txt  (ZeroGPU)
numpy<2            # 1.26.4 wheel → ABI matches torch-2.2.*
torch==2.2.1       # leave explicit so pip doesn’t try 2.5+
torchaudio==2.2.1
transformers>=4.40.0
fastapi
uvicorn
yt-dlp

In this build, I stripped out the Gradio interface (didn’t need it, extra bloat, just running via API), but if you want it, add it back in.

Now you need a read API key for your account. A read/write works too. But at least read.

Here is a driver that shows an example of transcribing from a web hosted mp3 (think S3 bucket, or wherever you host your stuff)

Note: Be sure to update with your API key and also your Username, to get the correct path. Also pick a real file to transcribe. TODO: Use os to load the API key from an environment variable if possible.

import requests, json, os

HF_TOKEN = "hf_YOUR_API_KEY"                       # repo-read scope
AUDIO_URL = (
    "https://example.com/some.mp3"
)


# 1. download the MP3 into memory
mp3_bytes = requests.get(AUDIO_URL, timeout=30).content

# 2. hit the Space
resp = requests.post(
    "https://<your-username>-whisper-large-v3.hf.space/transcribe",
    headers={"Authorization": f"Bearer {HF_TOKEN}"},
    data={"task": "transcribe"},                       # form field
    files={"audio": ("recording.mp3", mp3_bytes, "audio/mpeg")},  # multipart file
    timeout=300,
)

resp.raise_for_status()
print(json.dumps(resp.json(), indent=2))

Embedding magic?

That is where the mean-pooling (averaging) of the hidden layers, at inference, comes in. See the lines that contain this:

        hidden = enc_out.hidden_states[-1]  # (B, T, D)
        emb = torch.nn.functional.normalize(
            hidden.mean(dim=1), p=2, dim=-1
        ).squeeze().cpu().tolist()

This is taking the collection of last hidden layers, and just averaging them. However, a simple average is not a unit vector, which is why we call the normalize function on this mean-pool. So now you have vectors that, when you do cosine similarity, you can ignore the normalization in the dominator, and just focus on the on the MAC, multiply/accumulate operations which are more efficient.

So, that’s it for today. You should see the transcription, same one as the OpenAI API using Whisper, and you get a 1280 dim vector of floats back as well.

Note there are two endpoints here, /transcribe and /yt_transcribe. The “yt” version is for YouTube transcriptions, while the other one is for plain old audio transcriptions. I haven’t tried the YouTube transcription yet, so any feedback welcome if it’s broken.

Next time, I will talk about what to do with this vector.

5 Likes

Is it possible to retrieve the embeddings too when running whisper locally or from a direct import from pip install git+https://github.com/openai/whisper.git ?

1 Like

This is the closest I’ve got with o3’s help… :sweat_smile:

import whisper, torch

model = whisper.load_model("turbo")      # or "large-v3", "medium", …
file_name="audio.m4a"
audio = whisper.load_audio(file_name)
audio = whisper.pad_or_trim(audio)

mel = whisper.log_mel_spectrogram(audio,
                                  n_mels=model.dims.n_mels).to(model.device)

with torch.no_grad():
    frames = model.encoder(mel.unsqueeze(0))   # (1, n_frames, d_state)

embedding = frames.mean(dim=1)                 # (1, d_state)
print("embedding",embedding.shape)
embedding

Output:

embedding torch.Size([1, 1280])
tensor([[-0.2784, -0.0755, -0.0026,  ...,  0.0212,  0.0308,  0.0101]],
       device='cuda:0')

Is this the expected vector?

1 Like

The very first number in your list looks too big in magnitude … here are numbers I get.

“embedding”: [
-0.02574097365140915,
0.01992025040090084,
0.02521919459104538,
-0.0085120415315032,
-0.007345042657107115,
0.016543207690119743,
-0.012324860319495201,
-0.013023446314036846,
0.013012935407459736,
-0.01950785145163536,
0.018171969801187515,
0.015129524283111095,
0.001982167363166809,

Just high level code inspection, it doesn’t look like you are normalizing the vector. This is OK if you want to do “classical” cosine similarity, and divide out by the norms of the vectors dynamically … but this is a bunch of wasted computations if you ask me, since you can just do it once here before you throw it in your DB.

I will do local versions here too, maybe next for those wanting to skip HF? How about:

  1. Local PyTorch
  2. Local Docker

The Docker is then the gateway to other cloud providers, so it’s good to build this.

3 Likes

I see. Well, I will await the follow ups, perhaps I may catch the idea as we advance more.

That’s a good idea. I find HF interesting but it is nice to have free options too.

Thanks for the initiative of sharing this project with us!

2 Likes

Here is a local PyTorch build/setup for Mac (CPU only)

Setup a venv.

python3 -m venv whisper_cpu
source whisper_cpu/bin/activate
python -m pip install --upgrade pip

Get compatible PyTorch and Numpy

python -m pip install \
  --index-url https://download.pytorch.org/whl/cpu \
  torch==2.4.1 torchaudio==2.4.1 "numpy<2"

Check your config so far

python - <<'PY'
import torch, torchaudio, platform, sys
print("torch", torch.__version__, "| device ->", torch.device("cpu"))
print("torchaudio", torchaudio.__version__)
print("python", sys.version.split()[0], "|", platform.machine())
PY

Expected outputs

torch 2.4.1 | device -> cpu
torchaudio 2.4.1
python 3.12.8 | arm64

Install ffmpeg
brew install ffmpeg

Install additional things
python -m pip install "transformers>=4.40.0" tqdm soundfile

Create your whisper_cpu.py file with these contents (model pulled from :hugs:)

import sys, torch, torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from transformers.pipelines.audio_utils import ffmpeg_read

MODEL = "openai/whisper-large-v3"

processor = AutoProcessor.from_pretrained(MODEL)
model = (AutoModelForSpeechSeq2Seq
         .from_pretrained(MODEL, output_hidden_states=True)
         .eval())

def load(path):
    try:
        wav, sr = torchaudio.load(path)
    except Exception:
        with open(path, "rb") as f:
            wav = torch.from_numpy(ffmpeg_read(f.read(), 16_000)).unsqueeze(0)
        sr = 16_000
    if sr != 16_000:
        wav = torchaudio.functional.resample(wav, sr, 16_000)
    if wav.shape[0] > 1:
        wav = wav.mean(dim=0, keepdim=True)
    return wav

def transcribe(p):
    wav = load(p)
    inp = processor(wav.squeeze(), sampling_rate=16_000, return_tensors="pt")
    with torch.no_grad():
        enc = model.model.encoder(
            input_features=inp["input_features"], output_hidden_states=True
        )
        emb = torch.nn.functional.normalize(
            enc.hidden_states[-1].mean(1), p=2, dim=-1
        )[0].tolist()
        text = processor.batch_decode(model.generate(**inp), skip_special_tokens=True)[0]
    return text, emb

if __name__ == "__main__":
    if len(sys.argv) < 2:
        sys.exit("usage: python whisper_cpu.py <audio-file>")
    txt, vec = transcribe(sys.argv[1])
    print("\n=== Transcript ===\n", txt)
    print("\nEmbedding dim:", len(vec), "  (first 8 floats)", vec[:8])

Run the code locally:
python whisper_cpu.py ~/Downloads/test_audio.mp3

Get results:
=== Transcript ===

  Hey, it's your AI, just a voice from the cloud checking in. Got a minute to talk? Say whatever's on your mind. I'm listening.

Embedding dim: 1280   (first 8 floats) [-0.026704978197813034, 0.014102393761277199, 0.024107782170176506, -0.013233265839517117, -0.0075990138575434685, 0.015612567774951458, -0.01202069129794836, -0.010540582239627838]

Note, the first time you run this, you download the model, which takes a while:

preprocessor_config.json: 100%|███████████████████████████████████████████████████████████████████| 340/340 [00:00<00:00, 305kB/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████| 283k/283k [00:00<00:00, 3.71MB/s]
vocab.json: 100%|████████████████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 6.01MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████| 2.48M/2.48M [00:00<00:00, 7.30MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████| 494k/494k [00:00<00:00, 4.77MB/s]
normalizer.json: 100%|███████████████████████████████████████████████████████████████████████| 52.7k/52.7k [00:00<00:00, 18.3MB/s]
added_tokens.json: 100%|█████████████████████████████████████████████████████████████████████| 34.6k/34.6k [00:00<00:00, 39.7MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████| 2.07k/2.07k [00:00<00:00, 14.6MB/s]
config.json: 100%|███████████████████████████████████████████████████████████████████████████| 1.27k/1.27k [00:00<00:00, 21.1MB/s]
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████| 3.09G/3.09G [07:31<00:00, 6.84MB/s

After this, running will be quick, or as fast as your CPU. You will see some warnings, but you can ignore them or patch them.

UserWarning: `return_dict_in_generate` is NOT set to `True`, but `output_hidden_states` is. When `return_dict_in_generate` is not `True`, `output_hidden_states` is ignored.
  warnings.warn(
Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

You can port this over to GPU’s if you have them, or other architectures.

4 Likes

I’m curious what the issues are with “compatible”..

openai-whisper running on Python 3.13:

CUDA is available. Attempting to load model on GPU (cuda:0)...
Whisper model loaded on device: cuda:0
transcript: This is a radio show where people call us…

>>> print(torch.__version__, transformers.__version__, whisper.__version__)
2.7.0+cu126 4.51.3 20240930

Crank all your versions up to max:

pip install --upgrade --upgrade-strategy eager pip setuptools wheel build
### check torch url:  https://pytorch.org/get-started/locally/
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128
pip install --upgrade --upgrade-strategy eager transformers datasets[audio] accelerate huggingface_hub

Then for installing openai-whisper as a pip wheel normally and patching it for Python 3.13, I offer this script.

Run it with the same rights as you run pip, for example, if a system install on windows, use administrator command prompt:

python patch_whisper_pip_install.py

import urllib.request
import tarfile
import os
import subprocess
import tempfile
from pathlib import Path
import sys
import shutil
import warnings # For handling the DeprecationWarning if needed for older Pythons

# --- Configuration ---
PACKAGE_NAME = "openai-whisper"
PACKAGE_VERSION = "20240930"
PACKAGE_URL = f"https://files.pythonhosted.org/packages/f5/77/952ca71515f81919bd8a6a4a3f89a27b09e73880cebf90957eda8f2f8545/{PACKAGE_NAME}-{PACKAGE_VERSION}.tar.gz"
PACKAGE_FILENAME = f"{PACKAGE_NAME}-{PACKAGE_VERSION}.tar.gz"
EXTRACTED_DIR_NAME = f"{PACKAGE_NAME}-{PACKAGE_VERSION}" 
SETUP_PY_FILENAME = "setup.py"

OLD_FUNCTION_TEXT = """\
def read_version(fname="whisper/version.py"):
    exec(compile(open(fname, encoding="utf-8").read(), fname, "exec"))
    return locals()["__version__"]"""

NEW_FUNCTION_TEXT = """\
def read_version(fname="whisper/version.py"):
    local_vars = {}
    global_vars = {}  # unused
    with open(fname, encoding="utf-8") as f:
        exec(compile(f.read(), fname, "exec"), global_vars, local_vars)
    return local_vars["__version__"]"""

def uninstall_existing_package(package_name):
    """Attempts to uninstall an existing version of the package."""
    print(f"Attempting to uninstall any existing version of '{package_name}'...")
    pip_command = [sys.executable, "-m", "pip", "uninstall", "-y", package_name]
    try:
        process = subprocess.run(pip_command, check=False, capture_output=True, text=True) # check=False as it's okay if not installed
        if process.returncode == 0:
            print(f"Successfully uninstalled '{package_name}'.")
            if process.stdout: print("Uninstall stdout:\n", process.stdout)
        elif "not installed" in process.stdout.lower() or "not installed" in process.stderr.lower():
            print(f"'{package_name}' was not installed. No action taken.")
        else:
            print(f"Warning: 'pip uninstall {package_name}' may have encountered an issue (return code: {process.returncode}).")
            if process.stdout: print("Uninstall stdout:\n", process.stdout)
            if process.stderr: print("Uninstall stderr:\n", process.stderr)
            print("Continuing with installation...")
    except Exception as e:
        print(f"An error occurred while trying to uninstall '{package_name}': {e}")
        print("Continuing with installation...")


def download_file(url, destination_path):
    """Downloads a file from a URL to a destination path."""
    print(f"Downloading {url} to {destination_path}...")
    try:
        urllib.request.urlretrieve(url, destination_path)
        print("Download complete.")
    except Exception as e:
        print(f"Error downloading file: {e}")
        raise

def extract_tar_gz(tar_gz_path, extract_to_path):
    """Extracts a .tar.gz file, handling the filter argument for Python 3.12+."""
    print(f"Extracting {tar_gz_path} to {extract_to_path}...")
    try:
        with tarfile.open(tar_gz_path, "r:gz") as tar:
            # Check if the 'filter' argument is supported (Python 3.12+)
            if hasattr(tarfile, 'data_filter') and sys.version_info >= (3, 12):
                print("Using 'data' filter for tar extraction (Python 3.12+).")
                tar.extractall(path=extract_to_path, filter='data')
            else:
                # For older Python versions, or if data_filter is somehow not available
                # This will raise DeprecationWarning on 3.12/3.13 if filter is not specified
                # but we handle it by checking sys.version_info.
                # If on 3.12+ and data_filter is missing (unlikely), it would warn.
                # If on <3.12, it's the old behavior.
                if sys.version_info >= (3,12) and not hasattr(tarfile, 'data_filter'):
                     warnings.warn(
                        "tarfile.data_filter not found on Python 3.12+. "
                        "Falling back to default extraction which may be unsafe "
                        "and will raise DeprecationWarning.",
                        DeprecationWarning
                    )
                tar.extractall(path=extract_to_path)
        print("Extraction complete.")
    except Exception as e:
        print(f"Error extracting file: {e}")
        raise

def patch_setup_py(setup_py_path):
    """Patches the setup.py file by replacing the read_version function."""
    print(f"Patching {setup_py_path}...")
    try:
        content = setup_py_path.read_text(encoding="utf-8")
        
        if OLD_FUNCTION_TEXT not in content:
            if NEW_FUNCTION_TEXT in content:
                print("It appears setup.py is already patched or uses the new function definition. Skipping patch.")
                return True
            else:
                print(f"Error: Could not find the old function definition in {setup_py_path}.")
                print("The setup.py structure might have changed, or this script needs adjustment.")
                return False

        modified_content = content.replace(OLD_FUNCTION_TEXT, NEW_FUNCTION_TEXT)
        
        if modified_content == content:
            print(f"Error: Patching did not change the content of {setup_py_path}. This is unexpected.")
            return False

        setup_py_path.write_text(modified_content, encoding="utf-8")
        print("Patching successful.")
        return True
    except Exception as e:
        print(f"Error patching {setup_py_path}: {e}")
        raise

def install_package(package_source_dir):
    """Installs the package using pip from the source directory."""
    print(f"Installing package from {package_source_dir} using pip...")
    pip_command = [sys.executable, "-m", "pip", "install", "."]
    
    try:
        process = subprocess.run(pip_command, cwd=package_source_dir, check=True, capture_output=True, text=True)
        print("Installation successful.")
        print("Pip output:\n", process.stdout)
        if process.stderr:
            print("Pip errors/warnings (if any):\n", process.stderr)
    except subprocess.CalledProcessError as e:
        print(f"Error during pip installation:")
        print(f"Command: {' '.join(e.cmd)}")
        print(f"Return code: {e.returncode}")
        print(f"Stdout:\n{e.stdout}")
        print(f"Stderr:\n{e.stderr}")
        raise
    except Exception as e:
        print(f"An unexpected error occurred during pip installation: {e}")
        raise

def main():
    """Main function to download, patch, and install openai-whisper."""
    
    # 0. Uninstall any existing version
    uninstall_existing_package(PACKAGE_NAME)

    with tempfile.TemporaryDirectory() as temp_dir_str:
        temp_dir = Path(temp_dir_str)
        print(f"Created temporary directory: {temp_dir}")

        downloaded_tar_path = temp_dir / PACKAGE_FILENAME
        
        download_file(PACKAGE_URL, downloaded_tar_path)
        extract_tar_gz(downloaded_tar_path, temp_dir)
        
        source_code_dir = temp_dir / EXTRACTED_DIR_NAME
        if not source_code_dir.is_dir():
            print(f"Error: Expected extracted directory {source_code_dir} not found.")
            print(f"Contents of temp_dir: {list(temp_dir.iterdir())}")
            return

        setup_py_file_path = source_code_dir / SETUP_PY_FILENAME
        if not setup_py_file_path.is_file():
            print(f"Error: {SETUP_PY_FILENAME} not found in {source_code_dir}.")
            return
            
        if not patch_setup_py(setup_py_file_path):
            print("Aborting installation due to patching failure.")
            return

        install_package(source_code_dir)
        
        print("\n---")
        print(f"Patched {PACKAGE_NAME} installation process complete.")
        print("You should now be able to import and use 'whisper'.")
        print(f"To uninstall in the future, you can use: pip uninstall {PACKAGE_NAME}")
        print("---")

if __name__ == "__main__":
    if sys.version_info < (3, 9):
        print("Whisper requires Python 3.9 or newer to run.")
        sys.exit(1)
    
    print(f"Starting {PACKAGE_NAME} patch and installation script for Python 3.13 compatibility...")
    print(f"This script will download, patch, and install {PACKAGE_NAME} version {PACKAGE_VERSION}.")
    print("Please ensure you have an internet connection and pip is functional.")
    print("---")
    
    try:
        main()
    except Exception as e:
        print(f"\nAn overall error occurred: {e}")
        print("Installation may have failed. Please check the messages above.")
    finally:
        print("---")
        print("Script finished.")

Then delivering a Whisper to hack on in your site-packages.


It was reported that Linux/Triton might need this patch I whittled down, but I didn’t:

--- a/whisper/triton_ops.py
+++ b/whisper/triton_ops.py
@@ -90,7 +90,13 @@ def kernel(
             ]
         ),
     )
-    kernel.src = kernel.src.replace("MIDDLE_ROW_HERE", f"row{filter_width // 2}")
+
+    kernel.src = kernel.src.replace("MIDDLE_ROW_HERE", f"row{filter_width // 2}")
+
+    if hasattr(kernel, "_unsafe_update_src") is True:
+        kernel._unsafe_update_src(kernel.src)
+        kernel.hash = None
+
     return kernel

Then it’s time to test: A 1.6GB distilled whisper-3-large, converted to whisper format. Run on your 2GB cheapo nVidia GPU.

"""whisper demonstration of distilled turbo 1.6GB
(do not run initial download in IDLE due to downloader's bad display code)
"""
import torch
import whisper
from huggingface_hub import hf_hub_download

# for more support of pipeline, etc
#from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
#from datasets import load_dataset

NVIDIA_GPU = True

if NVIDIA_GPU and torch.cuda.is_available():
    target_device_str = "cuda:0"
    torch_dtype = torch.float16
    print(f"CUDA is available. Attempting to load model on GPU ({target_device_str})...")
else:
    target_device_str = "cpu"
    torch_dtype = torch.float32
    print("CUDA not available or NVIDIA_GPU set to False. Loading model on CPU...")
    NVIDIA_GPU = False # Ensure flag is correct if CUDA isn't available
device = torch.device(target_device_str)

model = None  # guard, can check instantiation instead of later tracebacks

model_id = "distil-whisper/distil-large-v3.5-openai"
model_path = hf_hub_download(
    repo_id=model_id,
    filename="model.bin")

model = whisper.load_model(model_path)
model.to(device)
print(f"Whisper model loaded on device: {next(model.parameters()).device}")

audio_file = "audio1.mp3"
sample = whisper.load_audio(audio_file)
sample = whisper.pad_or_trim(sample)
result = model.transcribe(sample, language="en")
print(result["text"])

Just more to verify “latest”; I’m running the latest numpy:

numpy.version
‘2.2.5’

There’s plenty of other packages that hate the latest numpy, but this stack is not one.

3 Likes

I was getting these weird ABI issues, so loading binaries directly which resulted in mismatches, but I will have to check this out, as I’d rather run the latest versions, if possible.

1 Like