I have been broadcasting a podcast called Unmaking Senseon general philosophical matters for a couple of years and there are over 300 episodes. I decided, when I got to grips with writing the API requests, that I would get Whisper to do transcriptions of all of it essentially by implementing it in a Python3 loop. It worked extremely well, and only cost about $25, despite some poor audio and a lot of wind noise.
However, one episode at least - I haven’t been through them all yet so there may be more - emerged from the process not only transcribed but apparently having been translated into something that at least approximated to Welsh. Here are the first few lines to prove it:
Yn dod i’r episod 40 o’r series 2, rwy’n meddwl o ddrau’r series hon ar y dysgu i’r llwyddiant, ond efallai y byddai’n werth ddweud unwaith eto beth y bydd y sylfaennau, y marcau o ddysgwyr ddysgwyr ddynol, yn ymwneud â’r unigol man neu ffyrdd. Ac mae gen i 10 o ffurfion y byddwn i’n hoffi ei ddysgrifio. Y cyntaf yw bod y bobl ddysgwyr a chyfnodol yn ymddangos newid, cymdeithas, anhygoel, ddifrif a gwahanol yn ysbrydol, fel asbectau positif o’r condisiwn dynol, nid yn ymddygiadol, yn ymgyrchu a’n ysgrifennu, fel y gallwch chi ddweud, fel y gallwn ni ddim gwneud penderfyniad amdano a’n ei ddysgu, ond os y gallwn, y gallwn.
I am not a Welsh speaker, but it looked like Welsh and Google Translate (sorry, Whisper) thought it was at least Welsh enough to have a stab at translating it back again into English. It clearly isn’t very good Welsh because it didn’t make much sense, but it made enough to convince me that it is indeed a sort of Welsh.
Can anyone explain this behaviour? I ran it again and the same thing happened. Has anyone else experience Whisper doing this kind of thing?
And on the same Whisper topic: older versions produced more than one kind of file; there were subtitles, time-series, and at least one more in addition to the transcription. Have they been discontinued? They don’t appear in the response json file.
So to explain the behaviour, my best guess wold be that the first 30 seconds were a bit inaudible or filled with noise. So whisper was uncertain which language it was confronted with and made a wrong guess and then continued on with the rest of the transcription in the wrong language
Thank you for the reference. I will look, but I wasn’t asking for any kind of translation: I just wanted an English trasncription of an audio file recorded in English. And 303 of 304 transcriptions did just that. So like you I suspect there was some kind of weird digital preface that triggered this behaviour.
Edit: Oh forgot for a moment that we were in the API Feedback thread, I tried to help you to find a solution when you only wanted to share the feedback.
Yes that’s what I understood, thank you for clarifing that I was a bit unclar in my response:
So Whisper thought the content you provided was in welsh (as a explained in the previous post) and then Whisper tried to transcribe the content in Welsh.
This is like using Google Assistant but setting it to English and then speaking in german. You’ll get some text out of it but the text might not make sense. This matches what you described when you were putting the text in Google translator.
So there was no translation happening, Whisper only thought that you provided welsh content due to the first 30 seconds of the file.
Hi linus, I read the blog you suggested, and it’s very interesting on several counts, but the model it is using is not available on the current OpenAI site, and as far as I can see the “Detecting language” option is not part of the current definable parameters.
Anyhow, I took your advice about specifying the input language using the new format as follows:
print(f"\nTranscribing file {file}.\n")
time.sleep(5.0)
with open(path + file, "rb") as audio_file:
start_time = time.time()
transcript = openai.Audio.transcribe(
model = "whisper-1",
file = audio_file,
options = {
"language" : "en", # this is ISO 639-1 format as recommended in the docs
"response_format" : "json",
"temperature" : "0"
}
)
end_time = time.time()
timed = end_time - start_time
contents = f"Transcription of {file} took {timed:2.3f} seconds.\nTranscript done at {time.strftime('%H:%M:%S')} on {time.strftime('%Y-%m-%d')}.\nHere is the transcript: \n{transcript['text']}"
print("\n\n",contents)
It still came out in Welsh! I listened to the audio, and as far as I can tell it’s not remotely corrupt or indecipherable, so the mystery persists. But thank you for your help. I think I will give up now but I may just try changing the temperature and perhaps the response_format in case that makes a difference.
Incidentally, this helped me to answer the second part of my original question: the API docs say
“The format of the transcript output [can be in] in one of these options: json, text, srt, verbose_json, or vtt.” Maybe “verbose json” will tell me something.
quote=“pudepiedj, post:6, topic:120780”] It still came out in Welsh!
[/quote]
Hmm this is really strange - at this point I’m quite puzzled why this is the case. One last idea to try in the future: Append a one minute audiofile before the text to transcribe to force the language to EN, then remove this from the transcript afterwards. But I believe when only 1/304 recordings is affected this is too much effort
Nice! Happy to hear that you got something useful out of this