Best multilingual speech to text models for CPU

I’m working on a FastAPI app for transcribing videos from different platforms. I need a speech-to-text model that:

  • Supports multiple languages
  • Works well on CPU (no GPU available)
  • Has fast transcription speed

I’ve tried Whisper (base and tiny), but I’m looking for suggestions from this community for the best CPU-friendly, multilingual model — either OpenAI’s or open-source.

Any suggestions or benchmarks would really help. Thanks!

Hello and welcome to the community!
for english (yes I know you mentioned multi-lingual, but still) faster-whisper with large-v3 (or v3-turbo for speed) still reigns supreme in terms of considering both accuracy and latency as a single unit. There’s also WhisperX for timestamping and diarization.

Now, for multilingual, your best bet are wav2vec2 variants, notably ones like XLS-R. Admittedly I feel like there was something else released by meta or somewhere else that had a pretty good multilingual capability that I can’t remember, but overall, most of the stuff that’s out there is using either some variant of whisper or wav2vec2/xls-r in terms of oss models.

Nowadays there’s a lot less development of purely specialized models like these, which is a real shame, but not really much we can do about it.

Most of the work that’s out there is now being done for multimodal llm input/output. Which is fine, but comes with its own set of problems and frustrations. I’m a firm believer in the advantages of architecting the pipeline yourself, especially when, for example, you want to run some models on CPU like you mentioned, or have complete control over voice output. Plus, modularity just makes life easier.

This is the best that I can offer to my knowledge that’s currently available. That being said, I haven’t peeked into inference quality via CPU, but like GPUs, that’s gonna depend on what kind of CPU you have and your system RAM.

I did a deep dive into this recently because I was looking for an STT model tuned for outputting phonemes, not english. Sadly accuracy rates for those aren’t ideal (<80%), and that seems to be a domain better suited for multimodal models at this time unfortuantely. Although who knows, I might spin up a few models to try out anyway.

2 Likes

Thanks for sharing this info! I 'll try models .

Keep going, and good luck working on the models based on data.

2 Likes