Tokens counting for Hebrew response seems much higher

When prompting the “gpt-3.5-turbo” model with Hebrew I’m getting a much shorter answer for much more tokens.

for example I was charged with 450 tokens for a 88 words respond a ration of ~0.20 while in English is seems more like ~0.7 ratio.

Can I do something different?

It all comes down to the tokenisation of the training set data and the numeric abundance of various letter and symbol groups, in essence, the training dataset contains text from the internet and other text data sources. If that dataset contains a lot of occurrences of the word “orange” then a token will be created to represent that word, other times a word part may be more numerous in usage and so it will get its own token. For example: apple is 1 token as its usage is common, but rhubarb is 3 tokens as its usage is more sparce.

Due to Hebrew usage being more sparce in the training dataset compared to English, Spanish, French, etc. etc. it will tend to be assigned more tokens per word as those symbol groups occur less often in its dataset.

As the model has already been trained there is not much that can be done if you wish to accurately encode Hebrew words especially if they contain letter modifiers, umlauts, and other less used (In western parlance) symbols.

3 Likes

its also that the tokenizer on the website straight up lies
I have tested it myself and it shows way more unknowen tokens than is reasonable

and the model can reason on stuff which is just in the series of unknown tokens and retrieve answers that give unknown tokens which should be impossible

The current GPT-3.5/GPT-4 token encoder on the OpenAI website was finally updated. If you choose GPT-3, you will get wrong counts - usually higher numbers because the older dictionary is half the size.

What are “unknown tokens” to you? Everything an AI receives must be tokenized, and every single character has a method - even if that means encoding a three-byte-length Chinese Unicode character into three byte-value tokens.

i must corect myself just looked into it.
so the website and tiktoken are fine (tho they do give non unicode charchters at times but do not actualy use unkowen)

huggingface has a wrong implementation of it which is fairly anoying…



I completely agree, they are overcharging without control.