Whisper (transcribe) API verbose_json results, format of language property?

In which format is the language property?

The results came as language: 'french' or language: 'english' and I would like to convert them to the two letter ISO-639-1 format. How to do that?

BTW: it would be great to have the language directly in ISO-639-1

Hi @joaquink,

The language is an optional parameter that can be used to increase accuracy when requesting a transcription. It should be in the ISO-639-1 format.

However, in the verbose transcription object response, the attribute "language" refers to the name of the detected language.

If you wish to convert the language name into the ISO-639-1 format, you can use a JSON dictionary with the language names mapped to their respective codes, or you can use the pycountry package in Python.

1 Like

The problem is that I need the list of the informal names outputted in the verbose transcription object response. I can guess that list… But a guess is just that. For example: will I get ‘persian’ or ‘farsi’ for the ‘fa’ ISO-639-1 code?

Meanwhile, I figured out how to get that list: by making calls specifying the language in ISO-639-1 format in the input to get the informal name in the response. Just a few dozens calls…

2 Likes

I upped that call count to 183 for all of ISO space. To get 57 valid abbreviations and languages.

[‘af’, ‘ar’, ‘hy’, ‘az’, ‘be’, ‘bs’, ‘bg’, ‘ca’, ‘zh’, ‘hr’, ‘cs’, ‘da’, ‘nl’, ‘en’, ‘et’, ‘fi’, ‘fr’, ‘gl’, ‘de’, ‘el’, ‘he’, ‘hi’, ‘hu’, ‘is’, ‘id’, ‘it’, ‘ja’, ‘kn’, ‘kk’, ‘ko’, ‘lv’, ‘lt’, ‘mk’, ‘ms’, ‘mi’, ‘mr’, ‘ne’, ‘no’, ‘fa’, ‘pl’, ‘pt’, ‘ro’, ‘ru’, ‘sr’, ‘sk’, ‘sl’, ‘es’, ‘sw’, ‘sv’, ‘tl’, ‘ta’, ‘th’, ‘tr’, ‘uk’, ‘ur’, ‘vi’, ‘cy’]

[‘afrikaans’, ‘arabic’, ‘armenian’, ‘azerbaijani’, ‘belarusian’, ‘bosnian’, ‘bulgarian’, ‘catalan’, ‘chinese’, ‘croatian’, ‘czech’, ‘danish’, ‘dutch’, ‘english’, ‘estonian’, ‘finnish’, ‘french’, ‘galician’, ‘german’, ‘greek’, ‘hebrew’, ‘hindi’, ‘hungarian’, ‘icelandic’, ‘indonesian’, ‘italian’, ‘japanese’, ‘kannada’, ‘kazakh’, ‘korean’, ‘latvian’, ‘lithuanian’, ‘macedonian’, ‘malay’, ‘maori’, ‘marathi’, ‘nepali’, ‘norwegian’, ‘persian’, ‘polish’, ‘portuguese’, ‘romanian’, ‘russian’, ‘serbian’, ‘slovak’, ‘slovenian’, ‘spanish’, ‘swahili’, ‘swedish’, ‘tagalog’, ‘tamil’, ‘thai’, ‘turkish’, ‘ukrainian’, ‘urdu’, ‘vietnamese’, ‘welsh’]

So then, a lookup tool in either direction
def iso639_lookup(lang: str, reverse: bool = None, **junk) -> str:
    """
    OpenAI whisper ISO-639-1 language code utility or compatibility - 2024-02

    :param lang: The language name or ISO-639-1 code to look up.
    :param reverse: If True, find the language name from the ISO-639-1 code.
                    If False or None, find the ISO-639-1 code from the language name.
    :return: The ISO-639-1 code or language name if found, otherwise None.
    """
    iso639 = {  # 57 languages supported by OpenAI whisper-1
    'afrikaans': 'af', 'arabic': 'ar', 'armenian': 'hy',
    'azerbaijani': 'az', 'belarusian': 'be', 'bosnian': 'bs',
    'bulgarian': 'bg', 'catalan': 'ca', 'chinese': 'zh',
    'croatian': 'hr', 'czech': 'cs', 'danish': 'da',
    'dutch': 'nl', 'english': 'en', 'estonian': 'et',
    'finnish': 'fi', 'french': 'fr', 'galician': 'gl',
    'german': 'de', 'greek': 'el', 'hebrew': 'he',
    'hindi': 'hi', 'hungarian': 'hu', 'icelandic': 'is',
    'indonesian': 'id', 'italian': 'it', 'japanese': 'ja',
    'kannada': 'kn', 'kazakh': 'kk', 'korean': 'ko',
    'latvian': 'lv', 'lithuanian': 'lt', 'macedonian': 'mk',
    'malay': 'ms', 'maori': 'mi', 'marathi': 'mr',
    'nepali': 'ne', 'norwegian': 'no', 'persian': 'fa',
    'polish': 'pl', 'portuguese': 'pt', 'romanian': 'ro',
    'russian': 'ru', 'serbian': 'sr', 'slovak': 'sk',
    'slovenian': 'sl', 'spanish': 'es', 'swahili': 'sw',
    'swedish': 'sv', 'tagalog': 'tl', 'tamil': 'ta',
    'thai': 'th', 'turkish': 'tr', 'ukrainian': 'uk',
    'urdu': 'ur', 'vietnamese': 'vi', 'welsh': 'cy'
    }
    if reverse:
        if len(lang) != 2 or not lang.isalpha():
            raise ValueError("ISO-639-1 abbreviation must be len=2 letters")
        # Find the dict key by searching for the value
        for language, abbreviation in iso639.items():
            if abbreviation == lang.strip().lower():
                return language
        return None  # None if the code not found
    else: 
        # match input style to dict format, retrieve
        formatted_lang = lang.strip().lower()
        return iso639.get(formatted_lang)  # will be None for unmatched

if __name__=="__main__":  # example
    lang = "Thai"  # your input
    # reverse = True  # reverse=True finds language from code
    iso639_out = iso639_lookup(lang,
                    reverse if 'reverse' in locals() else None)
    if iso639_out:
        print(iso639_out)
    else:
        print("No ISO-639 language match was found or returned.")
1 Like

There’s a list of languages with their ISO-639-1 codes available on wikipedia.

It’s also pretty easy to convert both ways with pycountry

Language name to ISO-639-1 code

import pycountry

def get_iso639_1_code(language_name):
    try:
        language = pycountry.languages.get(name=language_name)
        return language.alpha_2
    except AttributeError:
        return "ISO 639-1 code not found"

# Example usage
language_name = "English"
iso639_1_code = get_iso639_1_code(language_name)
print(iso639_1_code)

ISO-639-1 code to language name

import pycountry

def get_language_name(iso639_1_code):
    try:
        language = pycountry.languages.get(alpha_2=iso639_1_code)
        return language.name
    except AttributeError:
        return "Language name not found"

# Example usage
iso639_1_code = "en"
language_name = get_language_name(iso639_1_code)
print(language_name)
1 Like

It doesn’t do this though…

image

1 Like

I see. In that case:

Supported languages

We currently support the following languages through both the transcriptions and translationsendpoint:

Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.

1 Like

Please take a look at this Whisper tokenizer website!

1 Like

Thanks ! This is it.
This forum is a lot better than ChatGPT :slight_smile: