Need more efficient tokenizer for Korean

Hi everyone,

I have big concerns about poor tokenization quality of OpenAI (especially for Korean) which result in 3-5 times tokens than English (even compared to Japanese). According to LLM AI Tokens , LLM models of OpenAI users BPE as a tokenization method which produces large number of tokens for agglutinative languages like Korean.

Do you have any future plan to make it more efficient so that we can reduce number of tokens? It is getting more and more important for customer in Korea. (I heard a lot of similar concerns from people using OpenAI GPT models)

Thank you !

Yong Hee Park

1 Like

You need to look at the 100k tokenizer of GPT-3.5 and 4. It has a very deep alphabet, generally one token per Hangul character, which compares favorably to Chinese or Japanese.

(one color per token):


Consider that:

  1. there are over 11000 Korean combination characters in Unicode, that’s a large percentage already dedicated to just one world alphabet.
  2. Korean has a per-word token usage similar to other long words in Latin languages, and has stronger semantic associations between words by lower ambiguity in pronunciation and grouping similar-sounding phonetics, something that would be lost if there was just a Korean dictionary. A Korean dictionary, especially one that is just a fragment of the language vocabulary, would cause the same problems as tokenization does in English: AI can’t count letters or do regular code expressions, because it can’t see English letters.

Hi _j,

Thank you for detailed explanation and information.

I’m not asking for immediate or fast improvements. But, I hope that OpenAI should understand there’s much room for improvements from Korean language perspective.

I think token buffer limit will be gone after some changes (larger token buffer size). But, the disadvantages of large token length will be a big problem due to higher token price compared to English.

I hope that OpenAI could tokenize as following tokenizer. (bab2min/kiwipiepy hosted in github)
For same text as you mentioned, it generates 47 tokens with POS tag.

[Token(form=β€˜γ€Šβ€™, tag=β€˜SSO’, start=0, len=1),
Token(form=β€˜κΈ°μƒμΆ©β€™, tag=β€˜NNG’, start=1, len=3),
Token(form=β€˜γ€‹β€™, tag=β€˜SSC’, start=4, len=1),
Token(form=β€˜(’, tag=β€˜SSO’, start=5, len=1),
Token(form=β€˜ε―„η”ŸθŸ²β€™, tag=β€˜SH’, start=6, len=3),
Token(form=β€˜,’, tag=β€˜SP’, start=9, len=1),
Token(form=β€˜Parasite’, tag=β€˜SL’, start=11, len=8),
Token(form=β€˜)’, tag=β€˜SSC’, start=19, len=1),
Token(form=β€˜μ€β€™, tag=β€˜JX’, start=20, len=1),
Token(form=β€˜2019’, tag=β€˜SN’, start=22, len=4),
Token(form=β€˜λ…„β€™, tag=β€˜NNB’, start=26, len=1),
Token(form=β€˜5’, tag=β€˜SN’, start=28, len=1),
Token(form=β€˜μ›”β€™, tag=β€˜NNB’, start=29, len=1),
Token(form=β€˜30’, tag=β€˜SN’, start=31, len=2),
Token(form=β€˜μΌβ€™, tag=β€˜NNB’, start=33, len=1),
Token(form=β€˜μ—β€™, tag=β€˜JKB’, start=34, len=1),
Token(form=β€˜κ°œλ΄‰β€™, tag=β€˜NNG’, start=36, len=2),
Token(form=β€˜ν•˜β€™, tag=β€˜XSV’, start=38, len=1),
Token(form=β€˜α†«β€™, tag=β€˜ETM’, start=38, len=1),
Token(form=β€˜λŒ€ν•œλ―Όκ΅­β€™, tag=β€˜NNP’, start=40, len=4),
Token(form=β€˜μ˜β€™, tag=β€˜JKG’, start=44, len=1),
Token(form=β€˜λΈ”λž™β€™, tag=β€˜NNG’, start=46, len=2),
Token(form=β€˜μ½”λ―Έλ””β€™, tag=β€˜NNG’, start=49, len=3),
Token(form=β€˜μ„œμŠ€νŽœμŠ€β€™, tag=β€˜NNG’, start=53, len=4),
Token(form=β€˜μ˜ν™”β€™, tag=β€˜NNG’, start=58, len=2),
Token(form=β€˜μ΄β€™, tag=β€˜VCP’, start=60, len=1),
Token(form=β€˜λ‹€β€™, tag=β€˜EF’, start=61, len=1),
Token(form=β€˜.’, tag=β€˜SF’, start=62, len=1),
Token(form=β€˜λ΄‰μ€€ν˜Έβ€™, tag=β€˜NNP’, start=64, len=3),
Token(form=β€˜μ˜β€™, tag=β€˜JKG’, start=67, len=1),
Token(form=β€˜μΌκ³±β€™, tag=β€˜NR’, start=69, len=2),
Token(form=β€˜λ²ˆβ€™, tag=β€˜NNB’, start=72, len=1),
Token(form=β€˜μ§Έβ€™, tag=β€˜XSN’, start=73, len=1),
Token(form=β€˜μž₯νŽΈβ€™, tag=β€˜NNG’, start=75, len=2),
Token(form=β€˜μ˜ν™”β€™, tag=β€˜NNG’, start=78, len=2),
Token(form=β€˜λ‘œβ€™, tag=β€˜JKB’, start=80, len=1),
Token(form=β€˜,’, tag=β€˜SP’, start=81, len=1),
Token(form=β€˜ν•œβ€™, tag=β€˜MM’, start=83, len=1),
Token(form=β€˜μ§„μ›β€™, tag=β€˜NNG’, start=84, len=2),
Token(form=β€˜κ³Όβ€™, tag=β€˜JC’, start=86, len=1),
Token(form=β€˜κ³΅λ™β€™, tag=β€˜NNG’, start=88, len=2),
Token(form=β€˜κ°λ³Έβ€™, tag=β€˜NNG’, start=91, len=2),
Token(form=β€˜μ„β€™, tag=β€˜JKO’, start=93, len=1),
Token(form=β€˜μ“°β€™, tag=β€˜VV’, start=95, len=1),
Token(form=β€˜μ—ˆβ€™, tag=β€˜EP’, start=95, len=1),
Token(form=β€˜λ‹€β€™, tag=β€˜EF’, start=96, len=1),
Token(form=β€˜.’, tag=β€˜SF’, start=97, len=1)]

There’s how cl100k_base tokenize same text. It generates 103 tokens. Over 2 times of the previous one.

enc = tiktoken.encoding_for_model(β€œgpt-3.5-turbo”)
enc.encode(β€œγ€ŠκΈ°μƒμΆ©γ€‹(ε―„η”ŸθŸ², Parasite)은 2019λ…„ 5μ›” 30일에 κ°œλ΄‰ν•œ λŒ€ν•œλ―Όκ΅­μ˜ λΈ”λž™ μ½”λ―Έλ”” μ„œμŠ€νŽœμŠ€ μ˜ν™”μ΄λ‹€. λ΄‰μ€€ν˜Έμ˜ 일곱 번째 μž₯편 μ˜ν™”λ‘œ, ν•œμ§„μ›κ³Ό 곡 동 각본을 썼닀.”)
[28038, 21121, 77535, 54596, 102, 26123, 7, 15973, 226, 21990, 164, 253, 110, 11, 94137, 635, 8, 34804, 220, 679, 24, 75265, 226, 220, 20, 38389, 242, 220, 966, 33177, 19954, 74623, 167, 112, 231, 24486, 62060, 24486, 50273, 120, 89059, 255, 21028, 5251, 116, 242, 39519, 247, 3396, 66391, 57139, 90335, 90960, 25941, 169, 236, 250, 25941, 39623, 223, 57390, 13094, 13447, 13, 5251, 112, 231, 59269, 222, 48424, 21028, 84656, 22783, 109, 85721, 84766, 16633, 98, 169, 236, 116, 39623, 223, 57390, 17835, 11, 62398, 86351, 55421, 54780, 46230, 113, 58189, 17196, 223, 29099, 116, 18359, 3396, 235, 120, 13447, 13]


Yong Hee Park

1 Like