Getting access to GPT-4 Japanese model

jnousis · April 24, 2024, 2:01pm

Hello,

Are there any chances we could get our hands on the GPT-4 Japanese trained model?

We already have access to GPT-4 API however we are not a local Japanese business. However we run our localized Japanese website at https://www.travelmyth.jp/ and would like to use the Japanese trained GPT-4 model there.

_j · April 24, 2024, 2:17pm

Does this answer your question?

It does.

If you don’t believe me that it does:

Rewrite your response as a translation to English, please.

Hello!

GPT-4 is a multilingual model that supports many languages, including Japanese. Therefore, even if your business is not based in Japan, you can still use GPT-4 for your Japanese website, https://www.travelmyth.jp/.

Since you are already using the GPT-4 API, you can utilize it for generating Japanese text or other functionalities by sending Japanese input to the API. You might already be familiar with how to use the API, but make sure that the input data is correctly formatted in Japanese to process it appropriately.

If you need specific implementation advice or tips to improve performance in Japanese, please feel free to contact us. As a support team, we might be able to provide advice to help you achieve better results.

There is no “Japanese tuned” model ever discussed by OpenAI, and it would be mostly unnecessary.

jr.2509 · April 24, 2024, 2:19pm

Hi @_j - there actually is a dedicated Japanese model. It was announced last week.

We are releasing a GPT-4 custom model optimized for the Japanese language which offers improved performance in Japanese text and operates up to 3x faster than GPT-4 Turbo.

Source: Introducing OpenAI Japan

jnousis · April 24, 2024, 2:21pm

No actually that’s not the case.

A new model was announced last week here: Introducing OpenAI Japan

And it’s only available for local businesses in Japan.

The current model of the general GPT-4 is not very good at Japanese translations.

_j · April 24, 2024, 2:21pm

Aha, sneaky OpenAI.

“we’re providing local businesses with early access to a GPT-4 custom model”

More of their commitment to work with “partners” instead of turning on stuff in the API for us.

Since they show an improved token production rate as the highlighted feature, it is possible the optimization is model retraining on a different token BPE encoder, reducing the amount of multibyte Unicode required. Tiktoken doesn’t reveal any support.

dignity_for_all · April 24, 2024, 4:39pm

I hope to gain access as soon as possible.
Regardless of the finer details, I can’t really provide any comments until I’ve actually used it😅

supershaneski · April 24, 2024, 11:49pm

It seems there is no application page for it. I was hoping users from Japan can apply to test it. They are probably already working with it with select users/companies/institutions.

vb · April 24, 2024, 11:56pm

If the alpha period for ChatGPT in languages other than English is any indication this could take six months plus.
Hoping the best for you guys!

rotti · May 22, 2024, 7:45am

Are there any other languages that have similar models? It would be very interesting to examine those as well

douglasw · May 22, 2024, 9:49am

For tokenization, Japanese will tokenize at the kanji level, then syllable level for the parts written or spoken in hiragana, katakana. The subword in Japanese should be the phonetic. Instead of getting 1.3 tokens per word as in english, you get closer to 2. You could maybe cut the token count by combining common tokens that appear together like MASU, DESU etc. But I am not sure that is a winning combination.

douglasw · May 22, 2024, 10:09am

I’m not sure OpenAI is being totally honest with their example in the Japanese blog. In GPT-4-Turbo, you can go direct english to japanese translated text , but you will get a shorter output with no references to temperature etc. If you first answer the question in English, the text is close to the same structure as the Japanese text. If you then instruct GPT-4-Turbo to translate to Japanese as a helpful native Japanese speaking tour guide, you get something fairly close to their sample. If you tell it to translate as the Concierge at the Imperial Hotel, you get the honorific forms which is quite funny. If you use GPT-4o, you get a slightly shorter version. I think there is a lot of smoke and mirrors going on here. I wonder how much is due to different system prompt in the two models.

Shwapx · May 22, 2024, 1:03pm

Based on this everyone should get access to the Japanese model via the API.

“We plan to release the custom model more broadly in the API in the coming months.”

_j · May 22, 2024, 1:42pm

It is likely the “Japanese Model” is or is closely related to gpt-4o - using a token encoding dictionary twice the size, which could allow for complete hard-coded Joyo Kanji along with more direct Unicode coverage of characters in many more languages.

No such thing exists in computer science, except within some mora lookup tools.

Tokenization uses byte-pair encoding, iteratively reducing “popular” bytes that appear in training data next to each other (or those manually placed). Japanese by AI is output in UTF-8 unicode, using two to four bytes per character, the glyph of which has no relation to roots or pronunciation, on-yomi or kun-yomi. ‘desu’ is already a single token, along with other words, but expansion into Chinese characters when tokenizing can give more compression and semantics in training instead of a single character taking three tokens to write…

If you want to test the emergent learning of the AI, try sending shift-JIS bytes to it…

douglasw · May 22, 2024, 2:41pm

Thanks. I guess I was assuming it would follow something along the lines the Nelson for Character lookup when building a tokenizer. I’ll will have to play and see with stuff like 元気です and げんきです

_j · May 22, 2024, 3:21pm

" 元" with a leading space?

cl100k:

35469: b' \xe5\x85'
225: b'\x83'

cl200k:
single token 86868

Surprising is that “げ”, a single kana, in cl100k is two tokens and not its own token. The only preparation seems to have been all numbers 0-999 being single unjoinable tokens, made by joining rule and not by pre-insertion into a table, and the initial 256 bytes being reorganized by ASCII. Everything else is corpus training.

Still “desu” is the only “word” in your sample in either encoder.

supershaneski · May 22, 2024, 11:44pm

げ has a modifier added (け+゛) so that is probably why it is 2-tokens.

_j · May 23, 2024, 12:22am

It is not a “rule” and not based on the underlying encoding. じ for example is 100204. The token number order being like a popularity contest and that voiced hiragana being near the end, it just likely that the prepoderance of other corpus words pushed some Japanese characters out of the dictionary of possible byte pairings. It is just odd that 100208:“_softmax” ranked with another language’s alphabet.

Topic		Replies	Views
How to get an access to GPT Japan API gpt-4 , api	0	532	April 15, 2024
How to get access to GPT Custom model for Japanese? API	1	185	July 9, 2024
Call GPT-4 APIs to translate languages from Java micro-service API gpt-4	1	586	April 18, 2024
The model `gpt-4` does not exist or you do not have access to it API	3	1291	July 23, 2023
"text-embedding-3-small" Model is available in Japan? API embeddings , chatgpt , api	1	375	April 30, 2024

Getting access to GPT-4 Japanese model

Related topics