Hi _j,
Thank you for detailed explanation and information.
I’m not asking for immediate or fast improvements. But, I hope that OpenAI should understand there’s much room for improvements from Korean language perspective.
I think token buffer limit will be gone after some changes (larger token buffer size). But, the disadvantages of large token length will be a big problem due to higher token price compared to English.
I hope that OpenAI could tokenize as following tokenizer. (bab2min/kiwipiepy hosted in github)
For same text as you mentioned, it generates 47 tokens with POS tag.
[Token(form=‘《’, tag=‘SSO’, start=0, len=1),
Token(form=‘기생충’, tag=‘NNG’, start=1, len=3),
Token(form=‘》’, tag=‘SSC’, start=4, len=1),
Token(form=‘(’, tag=‘SSO’, start=5, len=1),
Token(form=‘寄生蟲’, tag=‘SH’, start=6, len=3),
Token(form=‘,’, tag=‘SP’, start=9, len=1),
Token(form=‘Parasite’, tag=‘SL’, start=11, len=8),
Token(form=‘)’, tag=‘SSC’, start=19, len=1),
Token(form=‘은’, tag=‘JX’, start=20, len=1),
Token(form=‘2019’, tag=‘SN’, start=22, len=4),
Token(form=‘년’, tag=‘NNB’, start=26, len=1),
Token(form=‘5’, tag=‘SN’, start=28, len=1),
Token(form=‘월’, tag=‘NNB’, start=29, len=1),
Token(form=‘30’, tag=‘SN’, start=31, len=2),
Token(form=‘일’, tag=‘NNB’, start=33, len=1),
Token(form=‘에’, tag=‘JKB’, start=34, len=1),
Token(form=‘개봉’, tag=‘NNG’, start=36, len=2),
Token(form=‘하’, tag=‘XSV’, start=38, len=1),
Token(form=‘ᆫ’, tag=‘ETM’, start=38, len=1),
Token(form=‘대한민국’, tag=‘NNP’, start=40, len=4),
Token(form=‘의’, tag=‘JKG’, start=44, len=1),
Token(form=‘블랙’, tag=‘NNG’, start=46, len=2),
Token(form=‘코미디’, tag=‘NNG’, start=49, len=3),
Token(form=‘서스펜스’, tag=‘NNG’, start=53, len=4),
Token(form=‘영화’, tag=‘NNG’, start=58, len=2),
Token(form=‘이’, tag=‘VCP’, start=60, len=1),
Token(form=‘다’, tag=‘EF’, start=61, len=1),
Token(form=‘.’, tag=‘SF’, start=62, len=1),
Token(form=‘봉준호’, tag=‘NNP’, start=64, len=3),
Token(form=‘의’, tag=‘JKG’, start=67, len=1),
Token(form=‘일곱’, tag=‘NR’, start=69, len=2),
Token(form=‘번’, tag=‘NNB’, start=72, len=1),
Token(form=‘째’, tag=‘XSN’, start=73, len=1),
Token(form=‘장편’, tag=‘NNG’, start=75, len=2),
Token(form=‘영화’, tag=‘NNG’, start=78, len=2),
Token(form=‘로’, tag=‘JKB’, start=80, len=1),
Token(form=‘,’, tag=‘SP’, start=81, len=1),
Token(form=‘한’, tag=‘MM’, start=83, len=1),
Token(form=‘진원’, tag=‘NNG’, start=84, len=2),
Token(form=‘과’, tag=‘JC’, start=86, len=1),
Token(form=‘공동’, tag=‘NNG’, start=88, len=2),
Token(form=‘각본’, tag=‘NNG’, start=91, len=2),
Token(form=‘을’, tag=‘JKO’, start=93, len=1),
Token(form=‘쓰’, tag=‘VV’, start=95, len=1),
Token(form=‘었’, tag=‘EP’, start=95, len=1),
Token(form=‘다’, tag=‘EF’, start=96, len=1),
Token(form=‘.’, tag=‘SF’, start=97, len=1)]
There’s how cl100k_base tokenize same text. It generates 103 tokens. Over 2 times of the previous one.
enc = tiktoken.encoding_for_model(“gpt-3.5-turbo”)
enc.encode(“《기생충》(寄生蟲, Parasite)은 2019년 5월 30일에 개봉한 대한민국의 블랙 코미디 서스펜스 영화이다. 봉준호의 일곱 번째 장편 영화로, 한진원과 공 동 각본을 썼다.”)
[28038, 21121, 77535, 54596, 102, 26123, 7, 15973, 226, 21990, 164, 253, 110, 11, 94137, 635, 8, 34804, 220, 679, 24, 75265, 226, 220, 20, 38389, 242, 220, 966, 33177, 19954, 74623, 167, 112, 231, 24486, 62060, 24486, 50273, 120, 89059, 255, 21028, 5251, 116, 242, 39519, 247, 3396, 66391, 57139, 90335, 90960, 25941, 169, 236, 250, 25941, 39623, 223, 57390, 13094, 13447, 13, 5251, 112, 231, 59269, 222, 48424, 21028, 84656, 22783, 109, 85721, 84766, 16633, 98, 169, 236, 116, 39623, 223, 57390, 17835, 11, 62398, 86351, 55421, 54780, 46230, 113, 58189, 17196, 223, 29099, 116, 18359, 3396, 235, 120, 13447, 13]
Thanks!
Yong Hee Park