All languages are NOT created (tokenized) equal

FYI

For those with in interest in using LLMs and understanding the tokens with regards to different spoken languages the following blog is of interest.

All languages are NOT created (tokenized) equal

Also see the related online app.

The purpose of this project is to compare the tokenization length for different languages. For some tokenizers, tokenizing a message in one language may result in 10-20x more tokens than a comparable message in another language (e.g. try English vs. Burmese). This is part of a larger project of measuring inequality in NLP.


Note:

The OpenAI tokenizer only demonstrates the following tokenizers

  • GPT-3
  • Codex

The app from the blog states it is using OpenAI GPT-4 tokenizer which is available from the GitHub repository for tiktoken.

1 Like

Interesting!

I’m wondering how much this has to do with how the words translate to English, I’m thinking that the closer you get to English the fewer tokens you have to use, but I’m also curious how this is affected by words with multiple meanings.

Here’s a few examples of Danish words that require extra context/tokens to be properly understood by an LLM:


Did you see that with the app you can

  1. Select different tokenizers

image

  1. Select different languages for comparison

image

One downside of the app is that I don’t see a way change the data source. There is an option to randomly sample the data source.

image

No I didn’t see that, but will definitely play with it later :heart:

I think it could be interesting to figure out what language results in the least amount of tokens used, although I’m expecting the answer to be English.

Not so fast.

For most general usage tokenizers where the training data is mostly English I would agree but I suspect either in private, research, non English countries and such that tokenizers might be crafted for a language other than English and give better results.

In the back of my mind I am also asking what a tokenizer for Math would do, how would the vectors work and can such vectors be incorporated into a general LLM based on say English.

Agreed,

I’m curious whether the cooperation between the Icelandic government and OpenAI will cause the amount of tokens used to tokenize Icelandic language to go down over time.

The sentence: The quick brown fox jumps over the lazy dog tokenized in English is 9 tokens. The Icelandic sentence (Hinn fljóti brúni refur hoppar yfir lata hundinn) is currently 22.

This was done using the tokenizer on the OpenAI site (GPT-3).

1 Like