Local Whisper Development Questions

codie · July 5, 2023, 7:38am

I’m developing a speech recognition subsystem using the OpenAI Whisper project. I’m developing it on my local environment. I’ve created a Voice Activity Detection algorithm that picks up only voice and scrapes out clean voice data pretty easily. Only problem is I am saving to disk and then passing a file location to whisper.transcribe() which obviously loads it from disk then transcribes.

I have an NVMe and I’m still getting 0.5 second transcription times, but I’d obviously like something more performant than diskspace as memory. I know there is a way to feed the audio directly to the transcribe function, but it doesn’t seem to like the format I have given it. I am getting this error RuntimeError: “reflection_pad1d” not implemented for ‘Short’.

It’s no biggy right now. I’m making good progress and the fix will probably be less than 20 lines of code. I’m 90% sure it is just a format issue. But if anyone has any experience with the whisper transcribe function that can point me to proper direct audio data passing, that would be much appreciated.

Thanks.

Foxalabs · July 5, 2023, 7:42am

Are you saying 500ms from the time you detect the speech to the time you start to make the API call or from that point to when you get a transcription back?

And is this a local instance or an OpenAI Whisper API call?

codie · July 5, 2023, 7:45am

I could be mistaken, but I’m pretty sure it doesn’t use the API.

whisper/whisper/transcribe.py at main · openai/whisper · GitHub

I’m running a local instance of this code. It downloaded a model the first run, so my assumption is that it is running on my own computer. An assumption that you have me rethinking xD.

EDIT:

Ok, just to confirm, no it is not using the RESTful API. I am indeed running it locally using the 1GB base model. So 500ms to load from disk, transcribe, and print the text.

Foxalabs · July 5, 2023, 7:57am

That does not seem unreasonable for the entire cycle, SSD’s use sophisticated cashing algorithms and as the data has only just been written it will most likely be available for retrieval if not from RAM, then high speed cashed memory. So in my (admittedly not there in front of it) opinion the SSD portion of your cycle is perhaps 1-5% of the total time, the majority will be spent inside the Whisper engine doing the decoding and manipulation of megabytes of data.

codie · July 5, 2023, 8:01am

Yeah that’s true, but I intend on having it run constantly in the background and I don’t want to burn out the NVMe. Better to keep it in memory. I know NVMe is like 6 times slower than RAM as well. Additionally, this will probably need to be scaled beyond just a single desktop to many users. It’s really meant to be a service running on the cloud or some sort workload compute server on prem.

Like I said, not super important right now, but I will need to fix it eventually. So I was wondering if anyone knew what the format/dimensions the data has to be in.

Foxalabs · July 5, 2023, 8:03am

For sure, you could as a quick and dirty test make a RAM disk… use that as the storage, just a lil 100Mbyte drive or something… worth a test.

codie · July 5, 2023, 8:09am

Oh sorry, to be clear. It is already working and it transcribes perfectly. Its just doing so with very bad engineering practices. This post was more about taking advantage of the direct audio data passing available in the whisper project.

Foxalabs · July 5, 2023, 8:19am

Understood, yep, you want people who may have already fed the service with streaming data to pass on their experience Ok, well I’ve not done that so far… new one on me also. If I get time I’ll ask around see if anyone has.

Foxalabs · July 5, 2023, 8:21am

of interest?

codie · July 5, 2023, 6:15pm

Oh nice, I’ll check this out today.

gbrandon · August 4, 2023, 5:44am

See the WhisperModel’s transcribe definition in transcribe.py

audio: Union[str, BinaryIO, np.ndarray],

You can pass a string filename, BytesIO(some buffer), or np.ndarray

The function decode_audio from faster_whisper.audio takes whatever audio format you want and returns an np.ndarray. In practice I find that the bare minimum is WAV with the 44 byte RIFF header

Topic		Replies	Views
Whisper error 400 "Unrecognized file format." API whisper	9	6558	May 6, 2024
OpenAI Whisper- Send Bytes (python) instead of filename API whisper	5	18031	February 20, 2024
Help Putting Whisper Code Into Python Script API	2	2751	January 29, 2024
Whisper spitting out gibberish when trying to transcribe API whisper	5	1855	December 6, 2025
All my attempts to improve accuracy and reduce hallucinations have the opposite effect! API whisper , hallucinations	7	3981	November 10, 2025

Local Whisper Development Questions

Related topics