Whisper API for Hindi Speech to Text

Hi,
I’m reaching out to seek assistance with an issue I’m encountering while using the Whisper API for Hindi speech-to-text transcription in my application.

Issue Description:

When transcribing short Hindi phrases consisting of 2-3 words, the Whisper API struggles to accurately capture the intended words. However, longer conversations with multiple sentences are transcribed with high precision. This inconsistency is affecting the reliability of my application, especially in scenarios where concise inputs are common.

Current Setup:

  • API Used: OpenAI Whisper API
  • Language Configuration: Set to Hindi by default (language='hi')
  • Implementation Details
import openai

openai.api_key = 'YOUR_API_KEY'

def transcribe_audio(file_path):
    with open(file_path, "rb") as audio_file:
        transcript = openai.Audio.transcribe("whisper-1", audio_file, language='hi')
        return transcript['text']

Steps Taken So Far:

  1. Language Parameter: Explicitly set the language parameter to Hindi to ensure the model prioritizes Hindi language processing.
  2. Input Variations: Tested multiple short phrases to determine if the issue persists across different inputs.
  3. Comparative Analysis: Compared transcriptions of short phrases against longer conversations to confirm the inconsistency in accuracy.
  4. Audio Quality: Verified that the audio recordings are clear and free from background noise, ruling out audio quality as a potential cause.

My Question:

Are there any recommended strategies or configurations within the Whisper API that can enhance the accuracy of transcribing short Hindi phrases? Specifically, I’m looking for ways to ensure precise word extraction for brief inputs without compromising the performance on longer conversations.

Any insights, suggestions, or best practices would be greatly appreciated.

Thank you for your time and support!

Best regards,
Shashank

Use the prompt parameter.

As the prompt, write Hindi language lead-up to what is spoken in the transcript (it is not for instructions).

Something plausible, like text of someone introducing a speaker from India who will be giving a presentation.

If it is only actual lengthy audio that will improve the output, you could place your own five seconds of preliminary speaking, something which is reliable and easy to strip out of the transcript response.

Good luck!

Hi, thanks for the inputs. Unfortunately this only solves part of the issue, and problem arises with different dialects and accents for Hindi Speech to Text. Is there any way we can customize and train the model with our own custom data with different dialects and accents using Whisper or OpenAI? Thanks!