Are there any chances we could get our hands on the GPT-4 Japanese trained model?
We already have access to GPT-4 API however we are not a local Japanese business. However we run our localized Japanese website at https://www.travelmyth.jp/ and would like to use the Japanese trained GPT-4 model there.
Rewrite your response as a translation to English, please.
Hello!
GPT-4 is a multilingual model that supports many languages, including Japanese. Therefore, even if your business is not based in Japan, you can still use GPT-4 for your Japanese website, https://www.travelmyth.jp/.
Since you are already using the GPT-4 API, you can utilize it for generating Japanese text or other functionalities by sending Japanese input to the API. You might already be familiar with how to use the API, but make sure that the input data is correctly formatted in Japanese to process it appropriately.
If you need specific implementation advice or tips to improve performance in Japanese, please feel free to contact us. As a support team, we might be able to provide advice to help you achieve better results.
There is no “Japanese tuned” model ever discussed by OpenAI, and it would be mostly unnecessary.
Hi @_j - there actually is a dedicated Japanese model. It was announced last week.
We are releasing a GPT-4 custom model optimized for the Japanese language which offers improved performance in Japanese text and operates up to 3x faster than GPT-4 Turbo.
“we’re providing local businesses with early access to a GPT-4 custom model”
More of their commitment to work with “partners” instead of turning on stuff in the API for us.
Since they show an improved token production rate as the highlighted feature, it is possible the optimization is model retraining on a different token BPE encoder, reducing the amount of multibyte Unicode required. Tiktoken doesn’t reveal any support.
It seems there is no application page for it. I was hoping users from Japan can apply to test it. They are probably already working with it with select users/companies/institutions.
For tokenization, Japanese will tokenize at the kanji level, then syllable level for the parts written or spoken in hiragana, katakana. The subword in Japanese should be the phonetic. Instead of getting 1.3 tokens per word as in english, you get closer to 2. You could maybe cut the token count by combining common tokens that appear together like MASU, DESU etc. But I am not sure that is a winning combination.
I’m not sure OpenAI is being totally honest with their example in the Japanese blog. In GPT-4-Turbo, you can go direct english to japanese translated text , but you will get a shorter output with no references to temperature etc. If you first answer the question in English, the text is close to the same structure as the Japanese text. If you then instruct GPT-4-Turbo to translate to Japanese as a helpful native Japanese speaking tour guide, you get something fairly close to their sample. If you tell it to translate as the Concierge at the Imperial Hotel, you get the honorific forms which is quite funny. If you use GPT-4o, you get a slightly shorter version. I think there is a lot of smoke and mirrors going on here. I wonder how much is due to different system prompt in the two models.
It is likely the “Japanese Model” is or is closely related to gpt-4o - using a token encoding dictionary twice the size, which could allow for complete hard-coded Joyo Kanji along with more direct Unicode coverage of characters in many more languages.
No such thing exists in computer science, except within some mora lookup tools.
Tokenization uses byte-pair encoding, iteratively reducing “popular” bytes that appear in training data next to each other (or those manually placed). Japanese by AI is output in UTF-8 unicode, using two to four bytes per character, the glyph of which has no relation to roots or pronunciation, on-yomi or kun-yomi. ‘desu’ is already a single token, along with other words, but expansion into Chinese characters when tokenizing can give more compression and semantics in training instead of a single character taking three tokens to write…
If you want to test the emergent learning of the AI, try sending shift-JIS bytes to it…
Thanks. I guess I was assuming it would follow something along the lines the Nelson for Character lookup when building a tokenizer. I’ll will have to play and see with stuff like 元気です and げんきです
Surprising is that “げ”, a single kana, in cl100k is two tokens and not its own token. The only preparation seems to have been all numbers 0-999 being single unjoinable tokens, made by joining rule and not by pre-insertion into a table, and the initial 256 bytes being reorganized by ASCII. Everything else is corpus training.
Still “desu” is the only “word” in your sample in either encoder.
It is not a “rule” and not based on the underlying encoding. じ for example is 100204. The token number order being like a popularity contest and that voiced hiragana being near the end, it just likely that the prepoderance of other corpus words pushed some Japanese characters out of the dictionary of possible byte pairings. It is just odd that 100208:“_softmax” ranked with another language’s alphabet.