Tokenizer is so high in Vietnamese

Hi Team,

We are planning to build a fine-tuned model that supports Vietnamese, but the tokenizer (OpenAI API) counting for Vietnamese words is too high.

“One” = 1 token [3198] but “Một” = 5 tokens [337, 157, 119, 247, 83]

Is there any way we can optimize the token count for other languages?

Many Thanks


1 Like

Hi, I am having the same issue. Have you found a solution yet? I am at the beginning of the project, and the cost seems so high right now. If you have, could you help me out? Thanks