Tokenizer is so high in Vietnamese

ky.dangthe · December 29, 2022, 7:19am

Hi Team,

We are planning to build a fine-tuned model that supports Vietnamese, but the tokenizer (OpenAI API) counting for Vietnamese words is too high.

“One” = 1 token [3198] but “Một” = 5 tokens [337, 157, 119, 247, 83]

Is there any way we can optimize the token count for other languages?

Many Thanks

Ky

tts_trannguyenduybao · August 4, 2023, 3:28am

Hi, I am having the same issue. Have you found a solution yet? I am at the beginning of the project, and the cost seems so high right now. If you have, could you help me out? Thanks

Topic		Replies	Views
Right to Left languages token count Prompting	1	957	March 18, 2023
Need more efficient tokenizer for Korean API	2	2245	July 4, 2023
Explosion in the number of tokens / words generated API gpt-4 , api	13	4490	August 9, 2023
Official tokenizer has huge count difference from OpenAI tokenizer API	12	4971	October 1, 2023
Count of input token in playground in Non English language API playground , token	3	340	April 22, 2024

Tokenizer is so high in Vietnamese

Related topics