I’m writing to you because I’ve noticed something that I hadn’t noticed before concerning the token count. I noticed that the number of tokens counted was 70% to 100% higher than the number of words generated in French. This isn’t the case in English, where it’s around 25% to 30%. While I can understand that there is a difference between the different languages, I find such a discrepancy surprising. Have you observed the same thing yourself?
Thanks in advance.
Hello to you.
This is a reasonable conclusion that languages with less representation in training data would have a less robust dictionary of native words, needing then to be formed by smaller particles of words.
- English - Approximately 1.35 billion speakers
- Mandarin Chinese - Approximately 1.12 billion speakers
- Hindi - Approximately 600 million speakers
- Spanish - Approximately 460 million speakers
- French - Approximately 280 million speakers
translated to English (the more reliable direction):
500+ => 334 English
The token are based on a long training process to find the most efficient compression method for a corpus. This dictionary is 100000 tokens, not enough for all the world’s languages to have a dictionary of every word variation of every capitalization, accent, sentence placement, etc.
Do you have an example of this?
If you don’t directly have a tokenizer you can use this one Tiktoken Web Interface cl100k_base
indeed, the number of tokens counted in usage seems to correspond to what the tokeniser indicates. it seems that the number of tokens explodes because of the French language, as there are a lot of accents. For example, the text tested here is 159 words long for 277 tokens, which makes a ratio of +74%.
This is enormous and the increase in the number of French users on my app means that my costs are skyrocketing, even though I had calibrated my rates on the basis of 25%.
Too bad, I’ll deal with it.
Thanks for your answers
The rule of thumb for the conversion between tokens used and words represented is 0.75.
If we apply this rule to your document then we have 227 tokens used, which would on average English text encode 170.25 words, with your example being 159 words, a difference of 6.6% more, a value below the margin of error as 0.75 words per token is very approximate.
I’m not sure where your value of 74% comes from, could you explain your calculation method?
As you can see, there are 277 tokens and not 227…
So, if I calculate using your method, it’s 0.57 words per token.
In my example, I simply calculated the variation from 159 to 277 as a percentage, which represents an increase of 74.214% of 159.
In my application, my rates and margin are calculated taking into account a ratio of 0.75 words per token, i.e. a variation of around 30%. At 0.57, I’m ruining my margin. What’s more, on certain queries I get a more disadvantageous conversion.
I hope my explanation is clear enough.
Ahh, ok, 277. So that should give you ~208 words, but you are only getting 159, given the approximate nature of tokenisation, I’d still say that’s within expectations.
Some languages will have more uncommon sequences in them. For 100 average tokens you expect 75 words encoded, you seem to be getting ~20% less in that example. It could be better or worse in others, I think Spanish actually came out better than English in encoding size. But I would not call 20% additional an explosion.
It all depends on your point of view. If i’m charged 1740 tokens instead of 1330 tokens to generate 1000 words, that represents an increase in costs of almost 31%. I’d call that a huge increase, especially when the volumes start to get high.
Especially as the mass of free users is cutting into my margin even more. In short, I’m not here to complain. I just wanted to know if this is normal or if it’s an anomaly concerning the French language.
In any case, I’d like to thank you for your answers and the time you’ve given me.
Low-resource languages will require more tokens per word than high-resource languages.
When developing the tokenizer model, the training data is byte-pair encoded with the goal of minimizing the number of total tokens across the data.
Since English is the most represented language in the training data (93% of the words in the training data for GPT-3 were in English), the tokenizer will be optimized for English.
Imagine a worst-case scenario of a language where the only tokens were individual characters, you might have a 4 or 5 token/word average. On the other extreme, you could have a language where every word was represented and you’d have a 1:1 relationship between tokens and words.
It just so happens that English, being so disproportionately represented in the training data is close to 1:1 (generally about 4 tokens for 3 words), from your data it appears French is closer to 7:4.
So, in this case it seems French would require about 31% more tokens-per-word, on average, than English does.
French, though, will be a relatively high-resource language compared to others, say Czech or Tamil. So, it could be much worse.
So basically you are assuming that openai hates the french? jk
Maybe using chinese would be cheaper. I mean single representation of characters wouldn’ that mean 1 token per word?
I neither tried nor do I speak chinese. But would be curious to know.
In theory a symbolic language like Chinese would be ideal for tokenisation, but I’m not sure what effect simplified Chinese has on this, my understanding is this is a more character based, or word part based system and so would be worse, but still better than most western languages if the tokenizer was to be optimised for those languages specifically.
You have to remember that
cl100k_base has a dictionary length of 100,257 tokens, I’m not sure how many are Chinese characters…
approximately 50k for the full set and 8105 in the simplified set.
8k seems small enough to have been encoded fully.
Yes, they could. But that would be a huge percentage of tokens dedicated to a fairly low-resource language. I think Chinese has on the order of 2,000 tokens dedicated to it.
Here’s a detailed analysis on the tokenizer done by Yennie Jun,
interesting, there are certainly some single token encodes but most look like 2 or 3