What is the exact behavior of defining the “language” parameter in “input_audio_transcription”? Will it only boost the defined language matching? In our testing, even when the language is defined, other languages are recognized and transcripts in other language than the defined one are created.
The documentation only says:
The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en
) format will improve accuracy and latency.
1 Like
We have observed that language recognition often fails with minimal audio input, likely because the model considers multiple possible languages. Additionally, during real-time transcription, the model does not reference previous audio fragments but evaluates only the current audio snippet in isolation. We have clearly experienced this behavior in our phone-based customer support solution, AnswerPal, where we use real-time transcription to respond to incoming calls immediately.
When we transcribe the entire conversation after the call concludes, we notice significantly higher accuracy because the model has the full context available. It should be noted that our conversations primarily occur in Dutch, French, and German, so recognition accuracy might actually be better for English compared to these languages.
By explicitly defining a language tag (such as an ISO-639-1 code), you encourage the model to interpret the input specifically from that language perspective, reducing the likelihood of selecting other languages from the 98 available options. While the model may still recognize other languages, the accuracy and latency significantly improve for the specified language.
2 Likes