Token size in Russian lang

msveshnikov · January 6, 2023, 10:07am

Hello, I found that for Russian text each letter is a token. At least I ingest this from my pricing bill. Is it correct? Not 0.75 word == token, like in English, but 1 russian letter == token?

PaulBellow · January 6, 2023, 5:36pm

Here’s a tokenizer where you can test…

Welcome to the community!

msveshnikov · January 6, 2023, 6:02pm

Yes, thank you! Tested, 1 russian letter is 1 token indeed. What a discrimination

mike_orlov · July 18, 2024, 2:26pm

Indeed, it is. The reason is language features like many prefixes and endings, word declensions etc. But it looks like the ratio has become better, since the original post.

Funny fact: every hieroglyph would be a token in Korean or Chinese.

Topic		Replies	Views
All languages are NOT created (tokenized) equal Community token , app , comparison , statistics	8	6760	August 13, 2023
How many words in Cyrillic can you get from a million tokens? Prompting gpt-35-turbo , chatgpt , token	3	9278	April 1, 2024
Explosion in the number of tokens / words generated API gpt-4 , api	13	5561	August 9, 2023
Tokens counting for Hebrew response seems much higher API	5	1487	December 20, 2023
Official tokenizer has huge count difference from OpenAI tokenizer API	12	5675	October 1, 2023

Token size in Russian lang

Related topics