All languages are NOT created (tokenized) equal

FYI

For those with in interest in using LLMs and understanding the tokens with regards to different spoken languages the following blog is of interest.

All languages are NOT created (tokenized) equal

Also see the related online app.

The purpose of this project is to compare the tokenization length for different languages. For some tokenizers, tokenizing a message in one language may result in 10-20x more tokens than a comparable message in another language (e.g. try English vs. Burmese). This is part of a larger project of measuring inequality in NLP.


Note:

The OpenAI tokenizer only demonstrates the following tokenizers

  • GPT-3
  • Codex

The app from the blog states it is using OpenAI GPT-4 tokenizer which is available from the GitHub repository for tiktoken.

2 Likes

Interesting!

I’m wondering how much this has to do with how the words translate to English, I’m thinking that the closer you get to English the fewer tokens you have to use, but I’m also curious how this is affected by words with multiple meanings.

Here’s a few examples of Danish words that require extra context/tokens to be properly understood by an LLM:


Did you see that with the app you can

  1. Select different tokenizers

image

  1. Select different languages for comparison

image

One downside of the app is that I don’t see a way change the data source. There is an option to randomly sample the data source.

image

No I didn’t see that, but will definitely play with it later :heart:

I think it could be interesting to figure out what language results in the least amount of tokens used, although I’m expecting the answer to be English.

Not so fast.

For most general usage tokenizers where the training data is mostly English I would agree but I suspect either in private, research, non English countries and such that tokenizers might be crafted for a language other than English and give better results.

In the back of my mind I am also asking what a tokenizer for Math would do, how would the vectors work and can such vectors be incorporated into a general LLM based on say English.

Agreed,

I’m curious whether the cooperation between the Icelandic government and OpenAI will cause the amount of tokens used to tokenize Icelandic language to go down over time.

The sentence: The quick brown fox jumps over the lazy dog tokenized in English is 9 tokens. The Icelandic sentence (Hinn fljĂłti brĂşni refur hoppar yfir lata hundinn) is currently 22.

This was done using the tokenizer on the OpenAI site (GPT-3).

1 Like

I’m interested in this topic. Have there been any updates since the last comment? I’m not sure we should charge differently for our services depending on the users’ country.

If you have a service that is unaware of the actual token consumption of users and doesn’t bill accordingly, along with billing the value that your service actually adds, then it will indeed be a better value than were the user to use an API key to converse with the AI directly for under-represented languages. Some languages will have significant amplification of the number of tokens per character or per semantically-similar language, Chinese being one of the largest disparities at 2-3 tokens per character.

Korean has a robust dictionary, where we have 1 hangul combination unicode per token, making most words 1-3 tokens. Other languages are varied.

If you are not specializing with your chat application, you are competing with ChatGPT, where the output language and its length is not discriminated against.

Very interesting observation here:

Seems like money could be saved, I’ll call this “the Korean discount” from now on. :laughing:

If anyone knows a language that tokenizes to even fever tokens, please tell us.