Whisper large-v3 model vs large-v2 model

I am currently working on a project where my objective is to transcribe audio calls from various languages into English. Until now, our application has been utilizing the large-v2 model, and we are considering migrating to the large-v3 model. However, upon testing both the large-v2 and large-v3 models on a set of 20 audio files, I observed that the large-v2 model generally produces better output compared to the large-v3 model, except in two instances where the large-v3 model performed better. Large-v2 transcripts are better by around 20 - 30%.

I am trying to understand if there’s something I might be overlooking. The large-v3 model is purported to be an improvement, yet in my experience, it seems to be the opposite.

For reference, I am using the code provided for the large-v3 model, which can be found here: huggingface[.]co/openai/whisper-large-v3.


Hello @AyushSachan.

Have you managed to improve the results? The same thing is happening to me, I am getting better results with the V2 version.
I am also thinking about doing fine-tuning or pre-optimizing the audio.

I don’t know if you have managed to improve the results, if so, we could discuss the changes here.

No, Im stil trying to figure it out how can i fix it.