The purpose of this project is to compare the tokenization length for different languages. For some tokenizers, tokenizing a message in one language may result in 10-20x more tokens than a comparable message in another language (e.g. try English vs. Burmese). This is part of a larger project of measuring inequality in NLP.
I’m wondering how much this has to do with how the words translate to English, I’m thinking that the closer you get to English the fewer tokens you have to use, but I’m also curious how this is affected by words with multiple meanings.
Here’s a few examples of Danish words that require extra context/tokens to be properly understood by an LLM:
No I didn’t see that, but will definitely play with it later
I think it could be interesting to figure out what language results in the least amount of tokens used, although I’m expecting the answer to be English.
For most general usage tokenizers where the training data is mostly English I would agree but I suspect either in private, research, non English countries and such that tokenizers might be crafted for a language other than English and give better results.
In the back of my mind I am also asking what a tokenizer for Math would do, how would the vectors work and can such vectors be incorporated into a general LLM based on say English.
I’m curious whether the cooperation between the Icelandic government and OpenAI will cause the amount of tokens used to tokenize Icelandic language to go down over time.
The sentence: The quick brown fox jumps over the lazy dog tokenized in English is 9 tokens. The Icelandic sentence (Hinn fljĂłti brĂşni refur hoppar yfir lata hundinn) is currently 22.
This was done using the tokenizer on the OpenAI site (GPT-3).
I’m interested in this topic. Have there been any updates since the last comment? I’m not sure we should charge differently for our services depending on the users’ country.
If you have a service that is unaware of the actual token consumption of users and doesn’t bill accordingly, along with billing the value that your service actually adds, then it will indeed be a better value than were the user to use an API key to converse with the AI directly for under-represented languages. Some languages will have significant amplification of the number of tokens per character or per semantically-similar language, Chinese being one of the largest disparities at 2-3 tokens per character.
Korean has a robust dictionary, where we have 1 hangul combination unicode per token, making most words 1-3 tokens. Other languages are varied.
If you are not specializing with your chat application, you are competing with ChatGPT, where the output language and its length is not discriminated against.