All languages are NOT created (tokenized) equal

EricGT · May 18, 2023, 9:43am

FYI

For those with in interest in using LLMs and understanding the tokens with regards to different spoken languages the following blog is of interest.

Also see the related online app.

The purpose of this project is to compare the tokenization length for different languages. For some tokenizers, tokenizing a message in one language may result in 10-20x more tokens than a comparable message in another language (e.g. try English vs. Burmese). This is part of a larger project of measuring inequality in NLP.

Note:

The OpenAI tokenizer only demonstrates the following tokenizers

GPT-3
Codex

The app from the blog states it is using OpenAI GPT-4 tokenizer which is available from the GitHub repository for tiktoken.

N2U · May 18, 2023, 10:05am

Interesting!

I’m wondering how much this has to do with how the words translate to English, I’m thinking that the closer you get to English the fewer tokens you have to use, but I’m also curious how this is affected by words with multiple meanings.

Here’s a few examples of Danish words that require extra context/tokens to be properly understood by an LLM:

EricGT · May 18, 2023, 10:08am

Did you see that with the app you can

Select different tokenizers

Select different languages for comparison

One downside of the app is that I don’t see a way change the data source. There is an option to randomly sample the data source.

N2U · May 18, 2023, 10:34am

No I didn’t see that, but will definitely play with it later

I think it could be interesting to figure out what language results in the least amount of tokens used, although I’m expecting the answer to be English.

EricGT · May 18, 2023, 10:40am

Not so fast.

For most general usage tokenizers where the training data is mostly English I would agree but I suspect either in private, research, non English countries and such that tokenizers might be crafted for a language other than English and give better results.

In the back of my mind I am also asking what a tokenizer for Math would do, how would the vectors work and can such vectors be incorporated into a general LLM based on say English.

N2U · May 18, 2023, 11:11am

Agreed,

I’m curious whether the cooperation between the Icelandic government and OpenAI will cause the amount of tokens used to tokenize Icelandic language to go down over time.

The sentence: The quick brown fox jumps over the lazy dog tokenized in English is 9 tokens. The Icelandic sentence (Hinn fljóti brúni refur hoppar yfir lata hundinn) is currently 22.

This was done using the tokenizer on the OpenAI site (GPT-3).

humbroll · August 13, 2023, 5:09am

I’m interested in this topic. Have there been any updates since the last comment? I’m not sure we should charge differently for our services depending on the users’ country.

_j · August 13, 2023, 9:22am

If you have a service that is unaware of the actual token consumption of users and doesn’t bill accordingly, along with billing the value that your service actually adds, then it will indeed be a better value than were the user to use an API key to converse with the AI directly for under-represented languages. Some languages will have significant amplification of the number of tokens per character or per semantically-similar language, Chinese being one of the largest disparities at 2-3 tokens per character.

Korean has a robust dictionary, where we have 1 hangul combination unicode per token, making most words 1-3 tokens. Other languages are varied.

If you are not specializing with your chat application, you are competing with ChatGPT, where the output language and its length is not discriminated against.

N2U · August 13, 2023, 5:05pm

Very interesting observation here:

Seems like money could be saved, I’ll call this “the Korean discount” from now on.

If anyone knows a language that tokenizes to even fever tokens, please tell us.

Topic		Replies	Views
Explosion in the number of tokens / words generated API gpt-4 , api	13	4675	August 9, 2023
Counting tokens for chat API calls (gpt-3.5-turbo) Documentation	5	28050	December 13, 2023
How does GPT-3 cost calculation for languages other than English? API	7	4458	February 20, 2023
Tokens counting for Hebrew response seems much higher API	5	1349	December 20, 2023
Token size in Russian lang API	3	1810	July 18, 2024

All languages are NOT created (tokenized) equal

Related topics