Hello, I found that for Russian text each letter is a token. At least I ingest this from my pricing bill. Is it correct? Not 0.75 word == token, like in English, but 1 russian letter == token?
Here’s a tokenizer where you can test…
Welcome to the community!
Yes, thank you! Tested, 1 russian letter is 1 token indeed. What a discrimination
Indeed, it is. The reason is language features like many prefixes and endings, word declensions etc. But it looks like the ratio has become better, since the original post.
Funny fact: every hieroglyph would be a token in Korean or Chinese.