[SOLVED] Whisper translates into Welsh

I have been broadcasting a podcast called Unmaking Senseon general philosophical matters for a couple of years and there are over 300 episodes. I decided, when I got to grips with writing the API requests, that I would get Whisper to do transcriptions of all of it essentially by implementing it in a Python3 loop. It worked extremely well, and only cost about $25, despite some poor audio and a lot of wind noise.

However, one episode at least - I haven’t been through them all yet so there may be more - emerged from the process not only transcribed but apparently having been translated into something that at least approximated to Welsh. Here are the first few lines to prove it:

Yn dod i’r episod 40 o’r series 2, rwy’n meddwl o ddrau’r series hon ar y dysgu i’r llwyddiant, ond efallai y byddai’n werth ddweud unwaith eto beth y bydd y sylfaennau, y marcau o ddysgwyr ddysgwyr ddynol, yn ymwneud â’r unigol man neu ffyrdd. Ac mae gen i 10 o ffurfion y byddwn i’n hoffi ei ddysgrifio. Y cyntaf yw bod y bobl ddysgwyr a chyfnodol yn ymddangos newid, cymdeithas, anhygoel, ddifrif a gwahanol yn ysbrydol, fel asbectau positif o’r condisiwn dynol, nid yn ymddygiadol, yn ymgyrchu a’n ysgrifennu, fel y gallwch chi ddweud, fel y gallwn ni ddim gwneud penderfyniad amdano a’n ei ddysgu, ond os y gallwn, y gallwn.

I am not a Welsh speaker, but it looked like Welsh and Google Translate (sorry, Whisper) thought it was at least Welsh enough to have a stab at translating it back again into English. It clearly isn’t very good Welsh because it didn’t make much sense, but it made enough to convince me that it is indeed a sort of Welsh.

Can anyone explain this behaviour? I ran it again and the same thing happened. Has anyone else experience Whisper doing this kind of thing?

And on the same Whisper topic: older versions produced more than one kind of file; there were subtitles, time-series, and at least one more in addition to the transcription. Have they been discontinued? They don’t appear in the response json file.

2 Likes

Hi @pudepiedj,

have you tried setting the language with

Detecting language using up to the first 30 seconds. Use `--language` to specify the language

I have taken this from How to Build an OpenAI Whisper API - Deepgram Blog :zap:, thougt that might help :slight_smile:

So to explain the behaviour, my best guess wold be that the first 30 seconds were a bit inaudible or filled with noise. So whisper was uncertain which language it was confronted with and made a wrong guess and then continued on with the rest of the transcription in the wrong language

Thank you for the reference. I will look, but I wasn’t asking for any kind of translation: I just wanted an English trasncription of an audio file recorded in English. And 303 of 304 transcriptions did just that. So like you I suspect there was some kind of weird digital preface that triggered this behaviour.

Edit: Oh forgot for a moment that we were in the API Feedback thread, I tried to help you to find a solution when you only wanted to share the feedback. :slight_smile:


Yes that’s what I understood, thank you for clarifing that I was a bit unclar in my response:

So Whisper thought the content you provided was in welsh (as a explained in the previous post) and then Whisper tried to transcribe the content in Welsh.

This is like using Google Assistant but setting it to English and then speaking in german. You’ll get some text out of it but the text might not make sense. This matches what you described when you were putting the text in Google translator.

So there was no translation happening, Whisper only thought that you provided welsh content due to the first 30 seconds of the file.

1 Like

Hi linus, I read the blog you suggested, and it’s very interesting on several counts, but the model it is using is not available on the current OpenAI site, and as far as I can see the “Detecting language” option is not part of the current definable parameters.

Anyhow, I took your advice about specifying the input language using the new format as follows:

print(f"\nTranscribing file {file}.\n")
            time.sleep(5.0)
            with open(path + file, "rb") as audio_file:
                start_time = time.time()
                transcript = openai.Audio.transcribe(
                    model = "whisper-1",
                    file = audio_file,
                    options = {
                        "language" : "en",           # this is ISO 639-1 format as recommended in the docs
                        "response_format" : "json",
                        "temperature" : "0"
                        }
                    )
                end_time = time.time()
                timed = end_time - start_time
                contents = f"Transcription of {file} took {timed:2.3f} seconds.\nTranscript done at {time.strftime('%H:%M:%S')} on {time.strftime('%Y-%m-%d')}.\nHere is the transcript: \n{transcript['text']}"
                print("\n\n",contents)

It still came out in Welsh! I listened to the audio, and as far as I can tell it’s not remotely corrupt or indecipherable, so the mystery persists. But thank you for your help. I think I will give up now but I may just try changing the temperature and perhaps the response_format in case that makes a difference.

Incidentally, this helped me to answer the second part of my original question: the API docs say
“The format of the transcript output [can be in] in one of these options: json, text, srt, verbose_json, or vtt.” Maybe “verbose json” will tell me something.

1 Like

quote=“pudepiedj, post:6, topic:120780”]
It still came out in Welsh!
[/quote]

Hmm this is really strange - at this point I’m quite puzzled why this is the case. One last idea to try in the future: Append a one minute audiofile before the text to transcribe to force the language to EN, then remove this from the transcript afterwards. But I believe when only 1/304 recordings is affected this is too much effort :smiley:

Nice! Happy to hear that you got something useful out of this :innocent:

2 Likes

Final Chapter:
I’d managed to do some transcriptions on Google Colab using the open source whisper so I thought I’d try the fateful “Welsh” episode there. It worked like a charm first time, and not only produced a nigh-on perfect English text, but four other files as well (*.json, *.vtt, *.srt, *tsv)
I’m not sure what is going on here, and still have no solution to the “Welsh” transcription by the OpenAI API Whisper-1, but I was interested to see that the json version produced on Colab contained not just the text but all the information required to produce all the other versions. Here’s an example of the extracted json content just in case there are other newbies here:

id : 112
seek : 86528
start : 887.0
end : 889.64
text :  And that is never going to be a final state.
tokenlist:  [51449, 843, 326, 318, 1239, 1016, 284, 307, 257, 2457, 1181, 13, 51581]
temperature : 0.0
avg_logprob : -0.1457313568361344
compression_ratio : 1.4943181818181819
no_speech_prob : 0.2664724886417389

This information is invaluable for would-be audio editors because it facilitates segment-selection. I am probably the only person in the world who doesn’t - didn’t - know this, so forgive me. I am assuming that these are tiktokens and that the “no_speech_prob” as high as 26.6%, if that is what it means, may partly explain the “Welsh”. Enough!

1 Like

I just had this same exact issue (still happening as off this posting). Perfectly clear American English recording translated as Welsh (confirmed using google translate detect language). Using the language parameter to force English had no effect. I believe there was quite a bit of empty audio at the start of the file.

Update: just tried a different recording and same result – Welsh

transcript = openai.Audio.transcribe(
                    model = "whisper-1",
                    file = audio_file,
                    options = {
                        "language" : "en",      
                        "response_format" : "json",
                        "temperature" : "0"
                        }
                    )

Hi Brian,
Thank you. Interesting. I will now see whether any more of mine have done this since you have had it happen to multiple files.

As I said, I eventually managed with no difficulty and no editing of the original recording by using the open-source version on Colab, but the json data at the end suggest that there is some ambiguity about the way the software is interpreting the speech because of the 26.6% “no_speech_prob”, if I am understanding that correctly. Have you looked at your json data to see whether it is registering similar values when it is translating into Welsh? Most of my “no_speech_prob” parameters were much lower, but some were as high as 94%. There is almost no silent space at the start. But if you are getting this on more than one recording, it suggests that it might be a more systemic problem.

I gather from what I take to be the original github repository that the no_speech_prob is used in conjunction with the avg_logprob to determine whether the segment should be regarded as silent according to this script. The no_speech_prob must be above 0.6 and simultaneously the avg_logprob below -1 for the segment to be treated as silent/skipped, but as the plot shows these conditions are never both met even though the no_speech_prob is sometimes very high. But I don’t think this helps with our problem, and I don’t know what the default thresholds are for the OpenAI API Whisper. Anyway, it’s been an interesting and educative exploration. Thanks for your interest.

1 Like

also receiving the same issue. most transcripts appear in welsh, but i’m speaking english.

1 Like

Great insights. Thank you.

Out of curiosity, did you ever try @linus’ advice on prepending a very obvious and coherent english audio file as a sort of “primer”? Would you mind sharing the first 10 seconds of your audio file?

Could it be possible that whatever you are saying initially is taken as Welsh and carried on as so?

What are your initial logprobs? Is Whisper reporting a high confidence with your first n seconds of audio?

1 Like

Hi! I wouldn’t say “most”, but certainly some, and I have still not checked the other transcriptions so the proportion may be greater than 1/304. BrianBray01 reports the same issue.

What is more, the current OpenAI Whisper-1 doesn’t seem to produce anything like a full json or verbose_json response: there is just text; no time-series; no metadata; nothing.

The older online open source whisper produces five kinds of file and at the end of the json file there is a heap of very useful data including diagnostics (see my previous post). I am not sure what is going on here. Why are OpenAI trying to charge for something less good than what it replaces?

I think my graphs show that the early stages of the audio are fine; the problems seem to arise in the middle with very high no_speech_prob statistics, but that doesn’t explain the Welsh transcription.

Experiencing the same issue - looking forward to a solution.

Would you mind sharing your audio? Or atleast the first 30 seconds? And the request that you use with it?

Late to the party here, but what happens if you chop out the first minute of the file and send the rest to Whisper? Does it snap back to English?

Hi, Curt, Ronald, Justin, acoloss, Brian and Linus,

Yes, it does snap back to English if I cut out the first 30 seconds of the file.

ffmpeg -i folder/Episode2-40-transcribed.m4a -ss 00:00:30 -c copy folder/Episode-2-40-trimmed.m4a

This produces a perfectly good English transcription of what is left. But it doesn’t explain why, given that the language was specified as ‘en’ (see the API call in an earlier post), it ignores it. And to my ears the complete audio file is perfectly clear, although of course I am not “listening” digitally.

I am also really puzzled by what the current OpenAI version is doing and what the strategy governing it is. The github version (a) comes with 11 options that determine the sophistication of the model; (b) produces five different file-types automatically. The OpenAI version offers no such model options and produces - as far as I can see - just the text transcription. Even the “verbose-json” just produces the text transcription with no timestamp data or anything else. But maybe I am just doing something wrong …?

2 Likes

I think the API version from OpenAI is a bit dumbed down from the open source version. If you can host the open source version, and need the additional features, then do that.

Not sure why the “digital ear” in the OpenAI version is glitched, but it saves on hosting hassles when you use the OpenAI API. Which is why I use it. The Huggingface version that I was using decided to crap-out one day, and a few days later the OpenAI version was released. Coincidence? I think not! But I moved over the OpenAI version out of the sake of time management.

2 Likes

Happy to share it privately with you. The audio is from a real doctor dicatating a letter. Patient is not identifiable but I still would prefer not to post this on a public forum.
How do I share privately?