Tokens counting for Hebrew response seems much higher

When prompting the “gpt-3.5-turbo” model with Hebrew I’m getting a much shorter answer for much more tokens.

for example I was charged with 450 tokens for a 88 words respond a ration of ~0.20 while in English is seems more like ~0.7 ratio.

Can I do something different?

It all comes down to the tokenisation of the training set data and the numeric abundance of various letter and symbol groups, in essence, the training dataset contains text from the internet and other text data sources. If that dataset contains a lot of occurrences of the word “orange” then a token will be created to represent that word, other times a word part may be more numerous in usage and so it will get its own token. For example: apple is 1 token as its usage is common, but rhubarb is 3 tokens as its usage is more sparce.

Due to Hebrew usage being more sparce in the training dataset compared to English, Spanish, French, etc. etc. it will tend to be assigned more tokens per word as those symbol groups occur less often in its dataset.

As the model has already been trained there is not much that can be done if you wish to accurately encode Hebrew words especially if they contain letter modifiers, umlauts, and other less used (In western parlance) symbols.