Why Whisper accuracy is lower when using whisper API than using OpenAI API?

An audio with a speech recording was used for ASR (speech recognition) using OpenAI (openai.Audio.transcribe() method) having a WER of 9%.
The same audio was processed using the Whisper API, using as model whisper-large-v2 (the latest model as stated) , with model.transcribe() method, and the result was a WER of 25% !
What is the difference ? According to the API documentation, OpenAI API Audio uses exactly the same whisper-large-v2 model. Is there any prompt engineering applied to the standard OpenAI API method?
How to reproduce the same model/results from OpenAI API but using the Whisper API?

Thanks in advance for your help.

Cheers,

Joao Paulo Lirani

1 Like

It could be multiple things. The OpenAI API may load the model with different parameters, such as anything that affects the processing or accuracy, may be the main thing. I obviously do not know how exactly they are processed, just a guess.

1 Like

How do I use whisper-large-v2?

I think I read on their documentation only V1 is available currently. How do you know you are using it?

Can you let me know? Thanks!! :slight_smile:

If you call the Speech-to-Text API you will get a return from Whisper v2.

The Audio API provides two speech to text endpoints, transcriptions and translations, based on our state-of-the-art open source large-v2 Whisper model.

https://platform.openai.com/docs/guides/speech-to-text