Whisper Transcription Inconsistencies and Guidance Needed for Production Use

We are currently integrating Whisper for speech transcription in our VR Unity based game and overall the performance has been great. However we are running into some inconsistent behavior that is preventing us from confidently moving to production.

In some cases Whisper returns text that is completely unrelated to the audio. A few examples we have seen are

  1. {“text”:“वा पल भॉळबा사�ा सं चर सर तककुव जानती साना टललरीster”}
  2. {“text”:“:fire::fire::fire:”}
  3. We have also noticed many responses where the output is only the word “you” even though the audio clearly contains a full spoken sentence.

These issues happen randomly and are difficult to reproduce which makes reliability a concern for live user interactions.

We are using whisper 1 through the API and it generally works well for our use case. At the same time we are open to trying out other models if they provide better consistency.

We would really appreciate any guidance from the community or OpenAI team on best practices for getting stable and accurate transcriptions. For example recommended audio formats, preprocessing steps, parameter choices or API usage patterns that can help avoid these hallucinations and keep the transcription aligned with the actual speech.

Any insights would be super helpful as we are preparing to ship this in production soon.

Thanks in advance!