OK, so today I will post the main code files, and describe how to get this working.
The quickest and easiest (and maybe cheapest) way to get this going on a GPU in the cloud, and rendering at ~15x real-time, is to pay HF the $9 a month for Pro. This allows you to clone the openai/whisper-large-v3
repo into your own ZeroSpace, and get going “quickly” (caveats in a minute). I will have a free way to do this as well for the hobbyist, who don’t need the insane speedup that a hefty GPU can provide. But this paid method will yield your own private endpoint that will transcribe 15 to 55 hours of audio per day, quickly too, for a flat monthly fee.
So go to openai/whisper-large-v3
in 
Then click “Deploy” and in the drop down, select “Spaces”.
From here, clone the repo, and boom, you have a starting CI/CD pipeline.
It won’t work, unfortunately (this is the caveat). As an exercise, you can try getting it to work, and you will find all sorts of python conflicts, in particular, conflicts with the numpy version and PyTorch versions, assuming you are shooting for the latest, vs. what the container inherits from HF, which is an old version of PyTorch. 
Good news, I spent three hours doing this dance, and have magical files that will work 
So in your CI/CD, edit you “app.py” to have these contents:
import spaces
import torch
from fastapi import FastAPI, UploadFile, File, Form
from pydantic import BaseModel
import tempfile
import os
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from transformers.pipelines.audio_utils import ffmpeg_read
import yt_dlp
from fastapi.responses import PlainTextResponse
app = FastAPI()
MODEL_NAME = "openai/whisper-large-v3"
# -------------------------------------------------------------
# ZeroGPU‑compatible lazy loader running on an H200 slice
# -------------------------------------------------------------
@spaces.GPU # HF scheduler looks for this
def whisper_infer(waveform, task="transcribe"):
"""Return (text, unit_embedding) for a 16‑kHz mono waveform."""
if not hasattr(whisper_infer, "model"):
device = "cuda"
processor = AutoProcessor.from_pretrained(MODEL_NAME)
model = (
AutoModelForSpeechSeq2Seq
.from_pretrained(MODEL_NAME, output_hidden_states=True)
.to(device)
.eval()
)
whisper_infer.processor = processor
whisper_infer.model = model
processor = whisper_infer.processor
model = whisper_infer.model
device = next(model.parameters()).device
inputs = processor(
waveform.squeeze(), sampling_rate=16_000, return_tensors="pt"
).to(device)
input_features = inputs["input_features"] # (B, 80, T)
with torch.no_grad():
# --- encoder pass for embedding ---
enc_out = model.model.encoder(
input_features=input_features, output_hidden_states=True
)
hidden = enc_out.hidden_states[-1] # (B, T, D)
emb = torch.nn.functional.normalize(
hidden.mean(dim=1), p=2, dim=-1
).squeeze().cpu().tolist()
# --- generate transcription ---
gen_ids = model.generate(**inputs)
text = processor.batch_decode(gen_ids, skip_special_tokens=True)[0]
return text, emb
# -------------------------------------------------------------
# Helper: decode any audio file to 16‑kHz mono tensor
# -------------------------------------------------------------
def decode_audio(path: str) -> torch.Tensor:
try:
wav, sr = torchaudio.load(path)
except Exception:
with open(path, "rb") as f:
raw = f.read()
arr = ffmpeg_read(raw, 16000) # returns np.ndarray (T,)
wav = torch.from_numpy(arr).unsqueeze(0)
sr = 16000
if sr != 16000:
wav = torchaudio.functional.resample(wav, sr, 16000)
if wav.shape[0] > 1:
wav = wav.mean(dim=0, keepdim=True)
return wav
# -------------------------------------------------------------
# FastAPI routes
# -------------------------------------------------------------
@app.get("/", response_class=PlainTextResponse)
def root():
return (
"Whisper‑Large‑v3 inference Space (ZeroGPU).\n"
"POST /transcribe – multipart file 'audio', field 'task'\n"
"POST /yt_transcribe – JSON {\"url\": …, \"task\": …}\n"
"Swagger UI: /docs"
)
@app.post("/transcribe")
async def transcribe(audio: UploadFile = File(...), task: str = Form("transcribe")):
with tempfile.NamedTemporaryFile(delete=False, suffix=".audio") as tmp:
tmp.write(await audio.read())
tmp_path = tmp.name
waveform = decode_audio(tmp_path)
os.remove(tmp_path)
text, embedding = whisper_infer(waveform, task)
return {"text": text, "embedding": embedding}
class YTRequest(BaseModel):
url: str
task: str = "transcribe"
@app.post("/yt_transcribe")
def yt_transcribe(req: YTRequest):
with tempfile.TemporaryDirectory() as tmpdir:
fp = os.path.join(tmpdir, "yt_audio.m4a")
yt_opts = {"format": "bestaudio/best", "outtmpl": fp, "quiet": True}
yt_dlp.YoutubeDL(yt_opts).download([req.url])
waveform = decode_audio(fp)
text, embedding = whisper_infer(waveform, req.task)
return {"text": text, "embedding": embedding}
# -------------------------------------------------------------
# Local dev entry‑point (ignored by HF runtime)
# -------------------------------------------------------------
if __name__ == "__main__":
import uvicorn
port = int(os.environ.get("PORT", 7860))
uvicorn.run("app:app", host="0.0.0.0", port=port, workers=1)
Also, modify your “requirements.txt” to this:
# requirements.txt (ZeroGPU)
numpy<2 # 1.26.4 wheel → ABI matches torch-2.2.*
torch==2.2.1 # leave explicit so pip doesn’t try 2.5+
torchaudio==2.2.1
transformers>=4.40.0
fastapi
uvicorn
yt-dlp
In this build, I stripped out the Gradio interface (didn’t need it, extra bloat, just running via API), but if you want it, add it back in.
Now you need a read API key for your account. A read/write works too. But at least read.
Here is a driver that shows an example of transcribing from a web hosted mp3 (think S3 bucket, or wherever you host your stuff)
Note: Be sure to update with your API key and also your Username, to get the correct path. Also pick a real file to transcribe. TODO: Use os
to load the API key from an environment variable if possible.
import requests, json, os
HF_TOKEN = "hf_YOUR_API_KEY" # repo-read scope
AUDIO_URL = (
"https://example.com/some.mp3"
)
# 1. download the MP3 into memory
mp3_bytes = requests.get(AUDIO_URL, timeout=30).content
# 2. hit the Space
resp = requests.post(
"https://<your-username>-whisper-large-v3.hf.space/transcribe",
headers={"Authorization": f"Bearer {HF_TOKEN}"},
data={"task": "transcribe"}, # form field
files={"audio": ("recording.mp3", mp3_bytes, "audio/mpeg")}, # multipart file
timeout=300,
)
resp.raise_for_status()
print(json.dumps(resp.json(), indent=2))
Embedding magic?
That is where the mean-pooling (averaging) of the hidden layers, at inference, comes in. See the lines that contain this:
hidden = enc_out.hidden_states[-1] # (B, T, D)
emb = torch.nn.functional.normalize(
hidden.mean(dim=1), p=2, dim=-1
).squeeze().cpu().tolist()
This is taking the collection of last hidden layers, and just averaging them. However, a simple average is not a unit vector, which is why we call the normalize
function on this mean-pool. So now you have vectors that, when you do cosine similarity, you can ignore the normalization in the dominator, and just focus on the on the MAC, multiply/accumulate operations which are more efficient.
So, that’s it for today. You should see the transcription, same one as the OpenAI API using Whisper, and you get a 1280 dim vector of floats back as well.
Note there are two endpoints here, /transcribe
and /yt_transcribe
. The “yt” version is for YouTube transcriptions, while the other one is for plain old audio transcriptions. I haven’t tried the YouTube transcription yet, so any feedback welcome if it’s broken.
Next time, I will talk about what to do with this vector.