Tokens counting for Hebrew response seems much higher

srulik · June 17, 2023, 7:45am

When prompting the “gpt-3.5-turbo” model with Hebrew I’m getting a much shorter answer for much more tokens.

for example I was charged with 450 tokens for a 88 words respond a ration of ~0.20 while in English is seems more like ~0.7 ratio.

Can I do something different?

Foxalabs · June 17, 2023, 8:43am

It all comes down to the tokenisation of the training set data and the numeric abundance of various letter and symbol groups, in essence, the training dataset contains text from the internet and other text data sources. If that dataset contains a lot of occurrences of the word “orange” then a token will be created to represent that word, other times a word part may be more numerous in usage and so it will get its own token. For example: apple is 1 token as its usage is common, but rhubarb is 3 tokens as its usage is more sparce.

Due to Hebrew usage being more sparce in the training dataset compared to English, Spanish, French, etc. etc. it will tend to be assigned more tokens per word as those symbol groups occur less often in its dataset.

As the model has already been trained there is not much that can be done if you wish to accurately encode Hebrew words especially if they contain letter modifiers, umlauts, and other less used (In western parlance) symbols.

nevo.krien · December 20, 2023, 3:16pm

its also that the tokenizer on the website straight up lies
I have tested it myself and it shows way more unknowen tokens than is reasonable

and the model can reason on stuff which is just in the series of unknown tokens and retrieve answers that give unknown tokens which should be impossible

_j · December 20, 2023, 3:21pm

The current GPT-3.5/GPT-4 token encoder on the OpenAI website was finally updated. If you choose GPT-3, you will get wrong counts - usually higher numbers because the older dictionary is half the size.

What are “unknown tokens” to you? Everything an AI receives must be tokenized, and every single character has a method - even if that means encoding a three-byte-length Chinese Unicode character into three byte-value tokens.

nevo.krien · December 20, 2023, 3:33pm

i must corect myself just looked into it.
so the website and tiktoken are fine (tho they do give non unicode charchters at times but do not actualy use unkowen)

huggingface has a wrong implementation of it which is fairly anoying…

AteneaIA · December 20, 2023, 8:43pm

I completely agree, they are overcharging without control.

Topic		Replies	Views
Explosion in the number of tokens / words generated API gpt-4 , api	13	3918	August 9, 2023
How does GPT-3 cost calculation for languages other than English? API	7	4108	February 20, 2023
Official tokenizer has huge count difference from OpenAI tokenizer API	12	4028	October 1, 2023
All languages are NOT created (tokenized) equal Community token , app , comparison , statistics	9	4265	December 17, 2023
Too much tokens - 1.794 tokens for 60 words - help to understand API chatgpt , api , api-rate-increase	7	1608	September 1, 2023

Tokens counting for Hebrew response seems much higher

Related topics