Hello, I found that for Russian text each letter is a token. At least I ingest this from my pricing bill. Is it correct? Not 0.75 word == token, like in English, but 1 russian letter == token?
Here’s a tokenizer where you can test…
Welcome to the community!
Yes, thank you! Tested, 1 russian letter is 1 token indeed. What a discrimination ![]()
Indeed, it is. The reason is language features like many prefixes and endings, word declensions etc. But it looks like the ratio has become better, since the original post.
Funny fact: every hieroglyph would be a token in Korean or Chinese.