[Whisper] Is there a way to tell the language before recognition?

Is it possible to send to the model a short sound file and it returns the language spoken the majority of the time?
From my testing, if I know what lang is spoken and pass in the value of the ‘lang’ for the whole conversation (say, between 2 people), then the result will be better.

Hi @content

Which model are you referring to?

Hi, we are using the medium model of Whisper, from the github page:

import whisper

model = whisper.load_model("base")

# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

# decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)

# print the recognized text

I haven’t used Whisper yet but if we look at the code, ideally it should be possible to take a tiny yet sufficient snippet of the audio, and then using the code above. The max(probs, key=probs.get) should return the most likely language and then run the entire audio file with the recognized language using:
whisper audio.flac audio.mp3 audio.wav --model medium
and specifying the detected language with --language option.

The only caveat will be that if the audio has multiple languages, the sample snippet will not be sufficient to determine the most prominent language, as the language determined will only be prominent in the snippet and not in the audio necessarily.

I’ll run this experiment and share my findings later.

1 Like

Thanks @sps, it does help. (Hmm, I have to type in random words to match the length requirement)

1 Like