The purpose of this project is to compare the tokenization length for different languages. For some tokenizers, tokenizing a message in one language may result in 10-20x more tokens than a comparable message in another language (e.g. try English vs. Burmese). This is part of a larger project of measuring inequality in NLP.
I’m wondering how much this has to do with how the words translate to English, I’m thinking that the closer you get to English the fewer tokens you have to use, but I’m also curious how this is affected by words with multiple meanings.
Here’s a few examples of Danish words that require extra context/tokens to be properly understood by an LLM:
For most general usage tokenizers where the training data is mostly English I would agree but I suspect either in private, research, non English countries and such that tokenizers might be crafted for a language other than English and give better results.
In the back of my mind I am also asking what a tokenizer for Math would do, how would the vectors work and can such vectors be incorporated into a general LLM based on say English.