An audio with a speech recording was used for ASR (speech recognition) using OpenAI (openai.Audio.transcribe() method) having a WER of 9%.
The same audio was processed using the Whisper API, using as model whisper-large-v2 (the latest model as stated) , with model.transcribe() method, and the result was a WER of 25% !
What is the difference ? According to the API documentation, OpenAI API Audio uses exactly the same whisper-large-v2 model. Is there any prompt engineering applied to the standard OpenAI API method?
How to reproduce the same model/results from OpenAI API but using the Whisper API?
It could be multiple things. The OpenAI API may load the model with different parameters, such as anything that affects the processing or accuracy, may be the main thing. I obviously do not know how exactly they are processed, just a guess.
If you call the Speech-to-Text API you will get a return from Whisper v2.
The Audio API provides two speech to text endpoints, transcriptions and translations, based on our state-of-the-art open source large-v2 Whisper model.