Need more efficient tokenizer for Korean

yonghp · July 4, 2023, 2:35am

Hi everyone,

I have big concerns about poor tokenization quality of OpenAI (especially for Korean) which result in 3-5 times tokens than English (even compared to Japanese). According to LLM AI Tokens , LLM models of OpenAI users BPE as a tokenization method which produces large number of tokens for agglutinative languages like Korean.

Do you have any future plan to make it more efficient so that we can reduce number of tokens? It is getting more and more important for customer in Korea. (I heard a lot of similar concerns from people using OpenAI GPT models)

Thank you !

Yong Hee Park

_j · July 4, 2023, 3:47am

You need to look at the 100k tokenizer of GPT-3.5 and 4. It has a very deep alphabet, generally one token per Hangul character, which compares favorably to Chinese or Japanese.

(one color per token):

hangul-1

Consider that:

there are over 11000 Korean combination characters in Unicode, that’s a large percentage already dedicated to just one world alphabet.
Korean has a per-word token usage similar to other long words in Latin languages, and has stronger semantic associations between words by lower ambiguity in pronunciation and grouping similar-sounding phonetics, something that would be lost if there was just a Korean dictionary. A Korean dictionary, especially one that is just a fragment of the language vocabulary, would cause the same problems as tokenization does in English: AI can’t count letters or do regular code expressions, because it can’t see English letters.

yonghp · July 4, 2023, 5:14am

Hi _j,

Thank you for detailed explanation and information.

I’m not asking for immediate or fast improvements. But, I hope that OpenAI should understand there’s much room for improvements from Korean language perspective.

I think token buffer limit will be gone after some changes (larger token buffer size). But, the disadvantages of large token length will be a big problem due to higher token price compared to English.

I hope that OpenAI could tokenize as following tokenizer. (bab2min/kiwipiepy hosted in github)
For same text as you mentioned, it generates 47 tokens with POS tag.

[Token(form=‘《’, tag=‘SSO’, start=0, len=1),
Token(form=‘기생충’, tag=‘NNG’, start=1, len=3),
Token(form=‘》’, tag=‘SSC’, start=4, len=1),
Token(form=‘(’, tag=‘SSO’, start=5, len=1),
Token(form=‘寄生蟲’, tag=‘SH’, start=6, len=3),
Token(form=‘,’, tag=‘SP’, start=9, len=1),
Token(form=‘Parasite’, tag=‘SL’, start=11, len=8),
Token(form=‘)’, tag=‘SSC’, start=19, len=1),
Token(form=‘은’, tag=‘JX’, start=20, len=1),
Token(form=‘2019’, tag=‘SN’, start=22, len=4),
Token(form=‘년’, tag=‘NNB’, start=26, len=1),
Token(form=‘5’, tag=‘SN’, start=28, len=1),
Token(form=‘월’, tag=‘NNB’, start=29, len=1),
Token(form=‘30’, tag=‘SN’, start=31, len=2),
Token(form=‘일’, tag=‘NNB’, start=33, len=1),
Token(form=‘에’, tag=‘JKB’, start=34, len=1),
Token(form=‘개봉’, tag=‘NNG’, start=36, len=2),
Token(form=‘하’, tag=‘XSV’, start=38, len=1),
Token(form=‘ᆫ’, tag=‘ETM’, start=38, len=1),
Token(form=‘대한민국’, tag=‘NNP’, start=40, len=4),
Token(form=‘의’, tag=‘JKG’, start=44, len=1),
Token(form=‘블랙’, tag=‘NNG’, start=46, len=2),
Token(form=‘코미디’, tag=‘NNG’, start=49, len=3),
Token(form=‘서스펜스’, tag=‘NNG’, start=53, len=4),
Token(form=‘영화’, tag=‘NNG’, start=58, len=2),
Token(form=‘이’, tag=‘VCP’, start=60, len=1),
Token(form=‘다’, tag=‘EF’, start=61, len=1),
Token(form=‘.’, tag=‘SF’, start=62, len=1),
Token(form=‘봉준호’, tag=‘NNP’, start=64, len=3),
Token(form=‘의’, tag=‘JKG’, start=67, len=1),
Token(form=‘일곱’, tag=‘NR’, start=69, len=2),
Token(form=‘번’, tag=‘NNB’, start=72, len=1),
Token(form=‘째’, tag=‘XSN’, start=73, len=1),
Token(form=‘장편’, tag=‘NNG’, start=75, len=2),
Token(form=‘영화’, tag=‘NNG’, start=78, len=2),
Token(form=‘로’, tag=‘JKB’, start=80, len=1),
Token(form=‘,’, tag=‘SP’, start=81, len=1),
Token(form=‘한’, tag=‘MM’, start=83, len=1),
Token(form=‘진원’, tag=‘NNG’, start=84, len=2),
Token(form=‘과’, tag=‘JC’, start=86, len=1),
Token(form=‘공동’, tag=‘NNG’, start=88, len=2),
Token(form=‘각본’, tag=‘NNG’, start=91, len=2),
Token(form=‘을’, tag=‘JKO’, start=93, len=1),
Token(form=‘쓰’, tag=‘VV’, start=95, len=1),
Token(form=‘었’, tag=‘EP’, start=95, len=1),
Token(form=‘다’, tag=‘EF’, start=96, len=1),
Token(form=‘.’, tag=‘SF’, start=97, len=1)]

There’s how cl100k_base tokenize same text. It generates 103 tokens. Over 2 times of the previous one.

enc = tiktoken.encoding_for_model(“gpt-3.5-turbo”)
enc.encode(“《기생충》(寄生蟲, Parasite)은 2019년 5월 30일에 개봉한 대한민국의 블랙 코미디 서스펜스 영화이다. 봉준호의 일곱 번째 장편 영화로, 한진원과 공 동 각본을 썼다.”)
[28038, 21121, 77535, 54596, 102, 26123, 7, 15973, 226, 21990, 164, 253, 110, 11, 94137, 635, 8, 34804, 220, 679, 24, 75265, 226, 220, 20, 38389, 242, 220, 966, 33177, 19954, 74623, 167, 112, 231, 24486, 62060, 24486, 50273, 120, 89059, 255, 21028, 5251, 116, 242, 39519, 247, 3396, 66391, 57139, 90335, 90960, 25941, 169, 236, 250, 25941, 39623, 223, 57390, 13094, 13447, 13, 5251, 112, 231, 59269, 222, 48424, 21028, 84656, 22783, 109, 85721, 84766, 16633, 98, 169, 236, 116, 39623, 223, 57390, 17835, 11, 62398, 86351, 55421, 54780, 46230, 113, 58189, 17196, 223, 29099, 116, 18359, 3396, 235, 120, 13447, 13]

Thanks!

Yong Hee Park

Topic		Replies	Views
All languages are NOT created (tokenized) equal Community token , app , comparison , statistics	9	5229	December 17, 2023
Official tokenizer has huge count difference from OpenAI tokenizer API	12	5077	October 1, 2023
Inquiry Regarding Token Counting in Japanese for GPT-3 API API	4	1706	September 4, 2023
Getting access to GPT-4 Japanese model API	16	3726	May 23, 2024
Counting Tokens and Rendering Content in HTML (Not the tags) Prompting chatgpt , api , token	6	1726	October 19, 2023

Need more efficient tokenizer for Korean

Related topics