We’re encountering a very odd problem: A whisper transcription (English speech) is translated (accurately) to Malay. This is happening sporadically and very hard to reproduce, however, we’ve had multiple users flag this problem (non of the users actually speak Malay).
Anything you can see similar in the files of the ones that translate to Malay? I’m thinking maybe it’s a word or phrase in the beginning that makes it think it’s Malay?
Thanks Paul. Yes, some of the input files include food items that could potentially be attributed to Malay (i.e. Chapati, which is of course Indian but also very common/popular in Malaysia). However, that’s just one word in an entire sentence and not perfectly matched/attributed to Malay/Malaysia
You might try adding it to the system prompt
Can you please elaborate on this? The input language is unknown - users can speak their mother tongue - and the transcript must always return in the same language as the input file. The documentation states
The current prompting system is much more limited than our other language models and only provides limited control over the generated audio,
Yeah, the tech is “bleeding edge,” so it breaks often currently. It’ll get better as time goes on.
Maybe something as simple as “Be sure to transcribe in the dominant language…” ??? … or something similar …
Ah, okay. I’ve not done a lot of transcribing personally, just trying to offer ideas. Could just be a weird bug where those food items tip the scale too much. If you can reproduce it, OpenAI would likely want to know about it if they don’t already (help.openai.com)…
You’ve already mentioned the limitations of prompting for Whisper, but it can still be used to your advantage. One important thing to note is that a whisper prompt is not an instruction comparable to a system prompt for a chat model. In case you haven’t tried it, here is my first recommendation:
Instead of prompting
Please transcribe the following text into English
we would use
The following is an audio transcript in English:
Apart from this, we do receive similar reports from time to time. For example, English-speaking users reporting transcriptions in Welsh.
A more sophisticated method could be to have another tool evaluate if the output is in the desired target language and retry if necessary. Note that for this deterministic task, an LLM would not be needed; you can use existing libraries for your programming language.
If the language is known, use language param, which is an optional parameter that can be used to increase accuracy when requesting a transcription. It should be in the ISO-639-1 format.
I guess the only way to guarantee that the correct language is being subscribed is to determine the language first and then either add it to the prompt (@vb ) or use the language param.
Might be a case of a smaller / faster model being useful? With the lower costs, you might give it a few examples (your most common languages?) then ask it to classify the first X words of the prompt? Would only have to do it once and could likely keep latency down (or not too crazy…)