I’m developing a speech recognition subsystem using the OpenAI Whisper project. I’m developing it on my local environment. I’ve created a Voice Activity Detection algorithm that picks up only voice and scrapes out clean voice data pretty easily. Only problem is I am saving to disk and then passing a file location to whisper.transcribe() which obviously loads it from disk then transcribes.
I have an NVMe and I’m still getting 0.5 second transcription times, but I’d obviously like something more performant than diskspace as memory. I know there is a way to feed the audio directly to the transcribe function, but it doesn’t seem to like the format I have given it. I am getting this error RuntimeError: “reflection_pad1d” not implemented for ‘Short’.
It’s no biggy right now. I’m making good progress and the fix will probably be less than 20 lines of code. I’m 90% sure it is just a format issue. But if anyone has any experience with the whisper transcribe function that can point me to proper direct audio data passing, that would be much appreciated.
Are you saying 500ms from the time you detect the speech to the time you start to make the API call or from that point to when you get a transcription back?
And is this a local instance or an OpenAI Whisper API call?
I’m running a local instance of this code. It downloaded a model the first run, so my assumption is that it is running on my own computer. An assumption that you have me rethinking xD.
EDIT:
Ok, just to confirm, no it is not using the RESTful API. I am indeed running it locally using the 1GB base model. So 500ms to load from disk, transcribe, and print the text.
That does not seem unreasonable for the entire cycle, SSD’s use sophisticated cashing algorithms and as the data has only just been written it will most likely be available for retrieval if not from RAM, then high speed cashed memory. So in my (admittedly not there in front of it) opinion the SSD portion of your cycle is perhaps 1-5% of the total time, the majority will be spent inside the Whisper engine doing the decoding and manipulation of megabytes of data.
Yeah that’s true, but I intend on having it run constantly in the background and I don’t want to burn out the NVMe. Better to keep it in memory. I know NVMe is like 6 times slower than RAM as well. Additionally, this will probably need to be scaled beyond just a single desktop to many users. It’s really meant to be a service running on the cloud or some sort workload compute server on prem.
Like I said, not super important right now, but I will need to fix it eventually. So I was wondering if anyone knew what the format/dimensions the data has to be in.
Oh sorry, to be clear. It is already working and it transcribes perfectly. Its just doing so with very bad engineering practices. This post was more about taking advantage of the direct audio data passing available in the whisper project.
Understood, yep, you want people who may have already fed the service with streaming data to pass on their experience Ok, well I’ve not done that so far… new one on me also. If I get time I’ll ask around see if anyone has.
See the WhisperModel’s transcribe definition in transcribe.py
audio: Union[str, BinaryIO, np.ndarray],
You can pass a string filename, BytesIO(some buffer), or np.ndarray
The function decode_audio from faster_whisper.audio takes whatever audio format you want and returns an np.ndarray. In practice I find that the bare minimum is WAV with the 44 byte RIFF header