Why Whisper accuracy is lower when using whisper API than using OpenAI API?

An audio with a speech recording was used for ASR (speech recognition) using OpenAI (openai.Audio.transcribe() method) having a WER of 9%.
The same audio was processed using the Whisper API, using as model whisper-large-v2 (the latest model as stated) , with model.transcribe() method, and the result was a WER of 25% !
What is the difference ? According to the API documentation, OpenAI API Audio uses exactly the same whisper-large-v2 model. Is there any prompt engineering applied to the standard OpenAI API method?
How to reproduce the same model/results from OpenAI API but using the Whisper API?

Thanks in advance for your help.


Joao Paulo Lirani

It could be multiple things. The OpenAI API may load the model with different parameters, such as anything that affects the processing or accuracy, may be the main thing. I obviously do not know how exactly they are processed, just a guess.

1 Like

How do I use whisper-large-v2?

I think I read on their documentation only V1 is available currently. How do you know you are using it?

Can you let me know? Thanks!! :slight_smile:

If you call the Speech-to-Text API you will get a return from Whisper v2.

The Audio API provides two speech to text endpoints, transcriptions and translations, based on our state-of-the-art open source large-v2 Whisper model.