Inquiry Regarding Token Counting in Japanese for GPT-3 API

I hope this message finds you well. I am writing to seek clarification regarding the token counting mechanism in the GPT-3 API, specifically in the context of the Japanese language.

As I understand from your documentation, you charge based on tokens, with billing occurring every 1000 tokens. However, I would appreciate further information on how tokens are counted when processing Japanese text.

For example, let’s consider the name “梅沢良成,” which is my name. This name contains 11 characters. If we were to count based on character length, it would be 4 tokens. If we were to count based on phonetics (pronunciation), it would be 8 tokens (うめざわよしなり). I would like to understand the token counting method for Japanese text more clearly.

Furthermore, I would like to suggest the option of counting tokens in Japanese text based on character length (i.e., counting each character as one token) for ease of understanding and consistency with the language structure.

Could you please provide me with more information on how tokens are counted for Japanese text, and consider the possibility of introducing a character-based token counting option for Japanese language users?

I appreciate your prompt response and assistance in addressing these questions and suggestions. Thank you for your attention to this matter.

token counting methods can be seen by using a token processor on the internet. Here is one example: https://tiktokenizer.vercel.app/

For estimating the use of tokens in your software, you can use a library module such as tiktoken. GitHub - openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI's models.

Only some of the more common Joyo Kanji are their own token in the dictionary of byte-pair encoding tokens. Others are two-byte representations, without the benefit of BPE compression, the same as Unicode.

I hope I’ve been of help (I’ve made this easy to translate to Japanese grammar).

ChatGPT翻訳: トークンのカウント方法は、インターネット上でトークンプロセッサを使用することで確認できます。以下は1つの例です:https://tiktokenizer.vercel.app/

ソフトウェア内でトークンの使用を見積もるために、openai/tiktokenのようなライブラリモジュールを使用できます:GitHub - openai/tiktoken:tiktokenはOpenAIのモデルと使用するための高速なBPEトークナイザーです。

常用漢字のうち、辞書のバイトペアエンコーディングトークンで独自のトークンとして存在するものは一部です。その他は2バイトの表現で、BPE圧縮の利点を受けられない、Unicodeと同様です。

お役に立てれば幸いです。[260 tokens]

reference: Japanese kana and word fragments that are their own tokens 【あ- て】:

[99644, 3, ‘】,【’]
[20146, 2, ‘】【’]
[56041, 1, ‘〜’]
[30592, 1, ‘あ’]
[57904, 2, ‘あり’]
[85778, 3, ‘ありが’]
[86791, 5, ‘ありがとう’]
[97136, 7, ‘ありがとうござ’]
[16996, 1, ‘い’]
[95605, 2, ‘いう’]
[61690, 3, ‘います’]
[30298, 1, ‘う’]
[58943, 1, ‘え’]
[33335, 1, ‘お’]
[32150, 1, ‘か’]
[55032, 2, ‘から’]
[29296, 1, ‘が’]
[50835, 1, ‘き’]
[47885, 1, ‘く’]
[72316, 4, ‘ください’]
[76623, 1, ‘け’]
[22958, 1, ‘こ’]
[51331, 2, ‘この’]
[85702, 2, ‘これ’]
[69848, 2, ‘こん’]
[87642, 3, ‘こんに’]
[90116, 5, ‘こんにちは’]
[48155, 1, ‘ご’]
[77122, 2, ‘ござ’]
[30814, 1, ‘さ’]
[65317, 2, ‘さい’]
[84390, 2, ‘され’]
[98370, 2, ‘さん’]
[75695, 1, ‘ざ’]
[15025, 1, ‘し’]
[67017, 2, ‘しか’]
[80674, 3, ‘しかし’]
[56052, 2, ‘した’]
[39927, 2, ‘して’]
[78435, 3, ‘します’]
[100205, 1, ‘じ’]
[17664, 1, ‘す’]
[54927, 2, ‘する’]
[72343, 1, ‘せ’]
[27930, 1, ‘そ’]
[80540, 3, ‘そして’]
[58427, 2, ‘その’]
[77694, 2, ‘それ’]
[28714, 1, ‘た’]
[90621, 2, ‘ただ’]
[36786, 1, ‘だ’]
[70898, 3, ‘ださい’]
[43515, 1, ‘ち’]
[86615, 2, ‘ちは’]
[42892, 1, ‘っ’]
[76948, 2, ‘って’]
[59740, 1, ‘つ’]
[38145, 1, ‘て’]
[16557, 1, ‘で’]
[38642, 2, ‘です’]
[77182, 2, ‘では’]
[72662, 2, ‘でも’]
[19733, 1, ‘と’]
[78700, 2, ‘とう’]

2 Likes

Welcome to the community @y.umezawa

The current tokenizer is optimized for English, which is why when it’s used for non-english languages, it results in a higher token count than expected.

The models (Transformers) have fundamentally been trained on bytes. Hence, the tokenizer’s job is to encode every possible sequence of bytes to tokens to be sent to models. Thus this criteria for token counting cannot be arbitrarily changed.

The rest has been well covered by both @Foxalabs and @_j

2 Likes

thank you.
I understand well (^^)

thank you!
I understood well (^^)

1 Like