Yes, they could. But that would be a huge percentage of tokens dedicated to a fairly low-resource language. I think Chinese has on the order of 2,000 tokens dedicated to it.
Here’s a detailed analysis on the tokenizer done by Yennie Jun,
Yes, they could. But that would be a huge percentage of tokens dedicated to a fairly low-resource language. I think Chinese has on the order of 2,000 tokens dedicated to it.
Here’s a detailed analysis on the tokenizer done by Yennie Jun,