Whisper large-v3 model vs large-v2 model

I am currently working on a project where my objective is to transcribe audio calls from various languages into English. Until now, our application has been utilizing the large-v2 model, and we are considering migrating to the large-v3 model. However, upon testing both the large-v2 and large-v3 models on a set of 20 audio files, I observed that the large-v2 model generally produces better output compared to the large-v3 model, except in two instances where the large-v3 model performed better. Large-v2 transcripts are better by around 20 - 30%.

I am trying to understand if there’s something I might be overlooking. The large-v3 model is purported to be an improvement, yet in my experience, it seems to be the opposite.

For reference, I am using the code provided for the large-v3 model, which can be found here: huggingface[.]co/openai/whisper-large-v3.


Hello @AyushSachan.

Have you managed to improve the results? The same thing is happening to me, I am getting better results with the V2 version.
I am also thinking about doing fine-tuning or pre-optimizing the audio.

I don’t know if you have managed to improve the results, if so, we could discuss the changes here.

No, Im stil trying to figure it out how can i fix it.

Has anyone done this test lately? Is Whisper v2 still considered better overall than Whisper v3? If that’s the case, it’s probably why OpenAI is still using Whisper v2 as the public API.