Token size in Russian lang

Hello, I found that for Russian text each letter is a token. At least I ingest this from my pricing bill. Is it correct? Not 0.75 word == token, like in English, but 1 russian letter == token?

Here’s a tokenizer where you can test…

Welcome to the community!

Yes, thank you! Tested, 1 russian letter is 1 token indeed. What a discrimination :slight_smile:

Indeed, it is. The reason is language features like many prefixes and endings, word declensions etc. But it looks like the ratio has become better, since the original post.

Funny fact: every hieroglyph would be a token in Korean or Chinese.