Hi Team,
We are planning to build a fine-tuned model that supports Vietnamese, but the tokenizer (OpenAI API) counting for Vietnamese words is too high.
“One” = 1 token [3198] but “Một” = 5 tokens [337, 157, 119, 247, 83]
Is there any way we can optimize the token count for other languages?
Many Thanks
Ky