Best multilingual speech to text models for CPU

Reema_Chanotia · August 7, 2025, 5:57am

I’m working on a FastAPI app for transcribing videos from different platforms. I need a speech-to-text model that:

Supports multiple languages
Works well on CPU (no GPU available)
Has fast transcription speed

I’ve tried Whisper (base and tiny), but I’m looking for suggestions from this community for the best CPU-friendly, multilingual model — either OpenAI’s or open-source.

Any suggestions or benchmarks would really help. Thanks!

Macha · September 26, 2025, 7:44am

Hello and welcome to the community!
for english (yes I know you mentioned multi-lingual, but still) faster-whisper with large-v3 (or v3-turbo for speed) still reigns supreme in terms of considering both accuracy and latency as a single unit. There’s also WhisperX for timestamping and diarization.

Now, for multilingual, your best bet are wav2vec2 variants, notably ones like XLS-R. Admittedly I feel like there was something else released by meta or somewhere else that had a pretty good multilingual capability that I can’t remember, but overall, most of the stuff that’s out there is using either some variant of whisper or wav2vec2/xls-r in terms of oss models.

Nowadays there’s a lot less development of purely specialized models like these, which is a real shame, but not really much we can do about it.

Most of the work that’s out there is now being done for multimodal llm input/output. Which is fine, but comes with its own set of problems and frustrations. I’m a firm believer in the advantages of architecting the pipeline yourself, especially when, for example, you want to run some models on CPU like you mentioned, or have complete control over voice output. Plus, modularity just makes life easier.

This is the best that I can offer to my knowledge that’s currently available. That being said, I haven’t peeked into inference quality via CPU, but like GPUs, that’s gonna depend on what kind of CPU you have and your system RAM.

I did a deep dive into this recently because I was looking for an STT model tuned for outputting phonemes, not english. Sadly accuracy rates for those aren’t ideal (<80%), and that seems to be a domain better suited for multimodal models at this time unfortuantely. Although who knows, I might spin up a few models to try out anyway.

Reema_Chanotia · October 9, 2025, 1:25pm

Thanks for sharing this info! I 'll try models .

Keep going, and good luck working on the models based on data.

Topic		Replies	Views
All my attempts to improve accuracy and reduce hallucinations have the opposite effect! API whisper , hallucinations	7	3358	November 10, 2025
Best solution for Whisper diarization/speaker labeling? API whisper	20	44944	October 16, 2025
Seeking Guidance on Text-to-Speech (TTS): Need Help and Advice Community whisper , tts	2	1541	June 13, 2024
Translation to a chosen language in real-time during a video conference Community gpt-4 , api	2	5903	February 21, 2024
Whisper Streaming Strategy API chatgpt , whisper , streaming	10	19623	October 22, 2025

Best multilingual speech to text models for CPU

Related topics