Explosion in the number of tokens / words generated

Yes, they could. But that would be a huge percentage of tokens dedicated to a fairly low-resource language. I think Chinese has on the order of 2,000 tokens dedicated to it.

@jochenschultz @Ailogik

Here’s a detailed analysis on the tokenizer done by Yennie Jun,

1 Like