Hi _j,
Thank you for detailed explanation and information.
Iβm not asking for immediate or fast improvements. But, I hope that OpenAI should understand thereβs much room for improvements from Korean language perspective.
I think token buffer limit will be gone after some changes (larger token buffer size). But, the disadvantages of large token length will be a big problem due to higher token price compared to English.
I hope that OpenAI could tokenize as following tokenizer. (bab2min/kiwipiepy hosted in github)
For same text as you mentioned, it generates 47 tokens with POS tag.
[Token(form=βγβ, tag=βSSOβ, start=0, len=1),
Token(form=βκΈ°μμΆ©β, tag=βNNGβ, start=1, len=3),
Token(form=βγβ, tag=βSSCβ, start=4, len=1),
Token(form=β(β, tag=βSSOβ, start=5, len=1),
Token(form=βε―ηθ²β, tag=βSHβ, start=6, len=3),
Token(form=β,β, tag=βSPβ, start=9, len=1),
Token(form=βParasiteβ, tag=βSLβ, start=11, len=8),
Token(form=β)β, tag=βSSCβ, start=19, len=1),
Token(form=βμβ, tag=βJXβ, start=20, len=1),
Token(form=β2019β, tag=βSNβ, start=22, len=4),
Token(form=βλ
β, tag=βNNBβ, start=26, len=1),
Token(form=β5β, tag=βSNβ, start=28, len=1),
Token(form=βμβ, tag=βNNBβ, start=29, len=1),
Token(form=β30β, tag=βSNβ, start=31, len=2),
Token(form=βμΌβ, tag=βNNBβ, start=33, len=1),
Token(form=βμβ, tag=βJKBβ, start=34, len=1),
Token(form=βκ°λ΄β, tag=βNNGβ, start=36, len=2),
Token(form=βνβ, tag=βXSVβ, start=38, len=1),
Token(form=βα«β, tag=βETMβ, start=38, len=1),
Token(form=βλνλ―Όκ΅β, tag=βNNPβ, start=40, len=4),
Token(form=βμβ, tag=βJKGβ, start=44, len=1),
Token(form=βλΈλβ, tag=βNNGβ, start=46, len=2),
Token(form=βμ½λ―Έλβ, tag=βNNGβ, start=49, len=3),
Token(form=βμμ€νμ€β, tag=βNNGβ, start=53, len=4),
Token(form=βμνβ, tag=βNNGβ, start=58, len=2),
Token(form=βμ΄β, tag=βVCPβ, start=60, len=1),
Token(form=βλ€β, tag=βEFβ, start=61, len=1),
Token(form=β.β, tag=βSFβ, start=62, len=1),
Token(form=βλ΄μ€νΈβ, tag=βNNPβ, start=64, len=3),
Token(form=βμβ, tag=βJKGβ, start=67, len=1),
Token(form=βμΌκ³±β, tag=βNRβ, start=69, len=2),
Token(form=βλ²β, tag=βNNBβ, start=72, len=1),
Token(form=βμ§Έβ, tag=βXSNβ, start=73, len=1),
Token(form=βμ₯νΈβ, tag=βNNGβ, start=75, len=2),
Token(form=βμνβ, tag=βNNGβ, start=78, len=2),
Token(form=βλ‘β, tag=βJKBβ, start=80, len=1),
Token(form=β,β, tag=βSPβ, start=81, len=1),
Token(form=βνβ, tag=βMMβ, start=83, len=1),
Token(form=βμ§μβ, tag=βNNGβ, start=84, len=2),
Token(form=βκ³Όβ, tag=βJCβ, start=86, len=1),
Token(form=β곡λβ, tag=βNNGβ, start=88, len=2),
Token(form=βκ°λ³Έβ, tag=βNNGβ, start=91, len=2),
Token(form=βμβ, tag=βJKOβ, start=93, len=1),
Token(form=βμ°β, tag=βVVβ, start=95, len=1),
Token(form=βμβ, tag=βEPβ, start=95, len=1),
Token(form=βλ€β, tag=βEFβ, start=96, len=1),
Token(form=β.β, tag=βSFβ, start=97, len=1)]
Thereβs how cl100k_base tokenize same text. It generates 103 tokens. Over 2 times of the previous one.
enc = tiktoken.encoding_for_model(βgpt-3.5-turboβ)
enc.encode(βγκΈ°μμΆ©γ(ε―ηθ², Parasite)μ 2019λ
5μ 30μΌμ κ°λ΄ν λνλ―Όκ΅μ λΈλ μ½λ―Έλ μμ€νμ€ μνμ΄λ€. λ΄μ€νΈμ μΌκ³± λ²μ§Έ μ₯νΈ μνλ‘, νμ§μκ³Ό 곡 λ κ°λ³Έμ μΌλ€.β)
[28038, 21121, 77535, 54596, 102, 26123, 7, 15973, 226, 21990, 164, 253, 110, 11, 94137, 635, 8, 34804, 220, 679, 24, 75265, 226, 220, 20, 38389, 242, 220, 966, 33177, 19954, 74623, 167, 112, 231, 24486, 62060, 24486, 50273, 120, 89059, 255, 21028, 5251, 116, 242, 39519, 247, 3396, 66391, 57139, 90335, 90960, 25941, 169, 236, 250, 25941, 39623, 223, 57390, 13094, 13447, 13, 5251, 112, 231, 59269, 222, 48424, 21028, 84656, 22783, 109, 85721, 84766, 16633, 98, 169, 236, 116, 39623, 223, 57390, 17835, 11, 62398, 86351, 55421, 54780, 46230, 113, 58189, 17196, 223, 29099, 116, 18359, 3396, 235, 120, 13447, 13]
Thanks!
Yong Hee Park